[jira] [Closed] (SPARK-15183) Adding outputMode to structure Streaming Experimental Api

2016-06-05 Thread Sachin Aggarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sachin Aggarwal closed SPARK-15183.
---
Resolution: Duplicate

> Adding outputMode to structure Streaming Experimental Api
> -
>
> Key: SPARK-15183
> URL: https://issues.apache.org/jira/browse/SPARK-15183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Reporter: Sachin Aggarwal
>Priority: Trivial
>
> while experimenting with structure streaming. I found that mode() is used for 
> non-continuous queries while outputMode() is used for continuous queries.
> ouputMode is not defined, so I have written the some raw implementation and 
> test cases just to make sure the streaming app works 
> Note:-
> /** Start a query */
>   private[sql] def startQuery(
>   name: String,
>   checkpointLocation: String,
>   df: DataFrame,
>   sink: Sink,
>   trigger: Trigger = ProcessingTime(0),
>   triggerClock: Clock = new SystemClock(),
>   outputMode: OutputMode = Append): ContinuousQuery = {
> As per me outputMode should be defined before triggerClock, the constructor 
> with  outputMode defined will be used more often then triggerClock.
> I have added triggerClock() method also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-06-05 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316222#comment-15316222
 ] 

Gaurav Shah commented on SPARK-2984:


seeing similar errors with spark 1.6 
{noformat}

App > Caused by: java.io.FileNotFoundException: File 
s3n://{bucket}.../_temporary/0/task_201606041650_0053_m_00/year=2016 does 
not exist.
App >   at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:1471)
App >   at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:366)
App >   at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:368)
App >   at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:315)
App >   at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
App >   at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
App >   at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)

{noformat}


> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> 

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-06-05 Thread menda venugopal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316187#comment-15316187
 ] 

menda venugopal commented on SPARK-13525:
-

Hi,

We have problem with similar kind of issue where SPARKML doesn't work with 
SPARKR with yarn-client mode. In local mode, it works. Spark core and Spark 
SQLworks without problem with R in local mode or yarn client.
#

> head(filter(df01, df01$C3 < 014001))
C0 C1 C2 C3 C4 C5 C6 C7
1 01001A008W-1 1 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 
46.134565 4.924122
2 01001A008W-2 2 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 
46.134633 4.924168
3 01001A008W-4 4 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 
46.134637 4.924353
4 01001A008W-5 5 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 
46.134518 4.924333
5 01001A008W-6 6 Lotissement Bellevue 01400 L'Abergement-Cl\xe9menciat CAD 
46.134345 4.924229
6 01001A015D-1 1 Lotissement Les Charmilles 01400 L'Abergement-Cl\xe9menciat 
C+O 46.151573 4.921591
> groupByCode_Postal <- groupBy(df01, df01$C3)
> code_Postal_G <- summarize(groupByCode_Postal, count = count(df01$C3))
> df <- createDataFrame(sqlContext, iris)
Warning messages:
1: In FUN(X[[5L]], ...) :
Use Sepal_Length instead of Sepal.Length as column name
2: In FUN(X[[5L]], ...) :
Use Sepal_Width instead of Sepal.Width as column name
3: In FUN(X[[5L]], ...) :
Use Petal_Length instead of Petal.Length as column name
4: In FUN(X[[5L]], ...) :
Use Petal_Width instead of Petal.Width as column name
> model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
> "gaussian")
model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
"gaussian")
16/05/27 12:33:59 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
host-172-30-125-248.openstacklocal): java.net.SocketTimeoutException: Accept 
timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)

#

16/05/27 12:34:29 ERROR TaskSetManager: Task 0 in stage 5.0 failed 4 times; 
aborting job
16/05/27 12:34:29 ERROR RBackendHandler: fitRModelFormula on 
org.apache.spark.ml.api.r.SparkRWrappers failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 
8, host-172-30-125-248.openstacklocal): java.net.SocketTimeoutException: Accept 
timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:530)
at java.net.ServerSocket.accept(ServerSocket.java:498)
at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(Ma


> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri 

[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file

2016-06-05 Thread marymwu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316170#comment-15316170
 ] 

marymwu commented on SPARK-15757:
-

Actually, Field "inv_date_sk" does exist! We have executed "desc inventory", 
the result is as attached.

> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed on this 
> orc file
> ---
>
> Key: SPARK-15757
> URL: https://issues.apache.org/jira/browse/SPARK-15757
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: Result.png
>
>
> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed
> 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in 
> stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): 
> java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA

[jira] [Issue Comment Deleted] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc

2016-06-05 Thread marymwu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marymwu updated SPARK-15757:

Comment: was deleted

(was: Actually,  Field "inv_date_sk" does exist! We have executed "desc 
inventory", the result is as attached.
)

> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed on this 
> orc file
> ---
>
> Key: SPARK-15757
> URL: https://issues.apache.org/jira/browse/SPARK-15757
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: Result.png
>
>
> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed
> 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in 
> stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): 
> java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file

2016-06-05 Thread marymwu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

marymwu updated SPARK-15757:

Attachment: Result.png

> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed on this 
> orc file
> ---
>
> Key: SPARK-15757
> URL: https://issues.apache.org/jira/browse/SPARK-15757
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
> Attachments: Result.png
>
>
> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed
> 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in 
> stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): 
> java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-15757) Error occurs when using Spark sql "select" statement on orc file after hive sql "insert overwrite tb1 select * from sourcTb" has been executed on this orc file

2016-06-05 Thread marymwu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316168#comment-15316168
 ] 

marymwu commented on SPARK-15757:
-

Actually,  Field "inv_date_sk" does exist! We have executed "desc inventory", 
the result is as attached.


> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed on this 
> orc file
> ---
>
> Key: SPARK-15757
> URL: https://issues.apache.org/jira/browse/SPARK-15757
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: marymwu
>
> Error occurs when using Spark sql "select" statement on orc file after hive 
> sql "insert overwrite tb1 select * from sourcTb" has been executed
> 0: jdbc:hive2://172.19.200.158:40099/default> select * from inventory;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 7.0 failed 8 times, most recent failure: Lost task 0.7 in 
> stage 7.0 (TID 2532, smokeslave5.avatar.lenovomm.com): 
> java.lang.IllegalArgumentException: Field "inv_date_sk" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:252)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:251)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$10.apply(OrcRelation.scala:361)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:361)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:123)
>   at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:112)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:278)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(fileSourceInterfaces.scala:262)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:357)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:774)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-05 Thread Jie Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316163#comment-15316163
 ] 

Jie Huang commented on SPARK-15046:
---

But the interesting thing is that the bug is not reproducing while changing to 
1.6 version. according to my understanding, when we submit a new job, we won't 
have the credential file (at the first time). This function won't be called at 
that time. That is why we are not noticing that issue before. Actually, that 
code snippet has been there since 1.6. 

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15046) When running hive-thriftserver with yarn on a secure cluster the workers fail with java.lang.NumberFormatException

2016-06-05 Thread Jie Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316162#comment-15316162
 ] 

Jie Huang commented on SPARK-15046:
---

Yes. I met the similar issue. And after that patch applied, it works. 

> When running hive-thriftserver with yarn on a secure cluster the workers fail 
> with java.lang.NumberFormatException
> --
>
> Key: SPARK-15046
> URL: https://issues.apache.org/jira/browse/SPARK-15046
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Trystan Leftwich
>
> When running hive-thriftserver with yarn on a secure cluster 
> (spark.yarn.principal and spark.yarn.keytab are set) the workers fail with 
> the following error.
> {code}
> 16/04/30 22:40:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
> java.lang.NumberFormatException: For input string: "86400079ms"
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:441)
>   at java.lang.Long.parseLong(Long.java:483)
>   at 
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
>   at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at 
> org.apache.spark.SparkConf$$anonfun$getLong$2.apply(SparkConf.scala:380)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.SparkConf.getLong(SparkConf.scala:380)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.getTimeFromNowToRenewal(SparkHadoopUtil.scala:289)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.org$apache$spark$deploy$yarn$AMDelegationTokenRenewer$$scheduleRenewal$1(AMDelegationTokenRenewer.scala:89)
>   at 
> org.apache.spark.deploy.yarn.AMDelegationTokenRenewer.scheduleLoginFromKeytab(AMDelegationTokenRenewer.scala:121)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$3.apply(ApplicationMaster.scala:243)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:243)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:723)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:721)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:748)
>   at 
> org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15704) TungstenAggregate crashes

2016-06-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15704:

Assignee: (was: Koert Kuipers)

> TungstenAggregate crashes 
> --
>
> Key: SPARK-15704
> URL: https://issues.apache.org/jira/browse/SPARK-15704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hiroshi Inoue
> Fix For: 2.0.0
>
>
> When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex 
> Aggregator" test case due to IndexOutOfBoundsException.
> The error happens in TungstenAggregate; the mappings between bufferSerializer 
> and bufferDeserializer are broken due to unresolved attribute.
> {quote}
> 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 
> 232)
> java.lang.IndexOutOfBoundsException: -1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 

[jira] [Updated] (SPARK-15704) TungstenAggregate crashes

2016-06-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15704:

Assignee: Koert Kuipers

> TungstenAggregate crashes 
> --
>
> Key: SPARK-15704
> URL: https://issues.apache.org/jira/browse/SPARK-15704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hiroshi Inoue
>Assignee: Koert Kuipers
> Fix For: 2.0.0
>
>
> When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex 
> Aggregator" test case due to IndexOutOfBoundsException.
> The error happens in TungstenAggregate; the mappings between bufferSerializer 
> and bufferDeserializer are broken due to unresolved attribute.
> {quote}
> 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 
> 232)
> java.lang.IndexOutOfBoundsException: -1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at 

[jira] [Resolved] (SPARK-15704) TungstenAggregate crashes

2016-06-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15704.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13446
[https://github.com/apache/spark/pull/13446]

> TungstenAggregate crashes 
> --
>
> Key: SPARK-15704
> URL: https://issues.apache.org/jira/browse/SPARK-15704
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hiroshi Inoue
> Fix For: 2.0.0
>
>
> When I run DatasetBenchmark, the JVM crashes while executing "Dataset complex 
> Aggregator" test case due to IndexOutOfBoundsException.
> The error happens in TungstenAggregate; the mappings between bufferSerializer 
> and bufferDeserializer are broken due to unresolved attribute.
> {quote}
> 16/06/02 01:41:05 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 
> 232)
> java.lang.IndexOutOfBoundsException: -1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate$RichAttribute.right(interfaces.scala:389)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:110)
>   at 
> org.apache.spark.sql.execution.aggregate.TypedAggregateExpression$$anonfun$3.applyOrElse(TypedAggregateExpression.scala:109)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:265)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:307)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:356)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:270)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:336)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:334)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> 

[jira] [Resolved] (SPARK-15748) Replace inefficient foldLeft() call in PartitionStatistics

2016-06-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15748.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Replace inefficient foldLeft() call in PartitionStatistics
> --
>
> Key: SPARK-15748
> URL: https://issues.apache.org/jira/browse/SPARK-15748
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> PartitionStatistics uses foldLeft and list concatenation to flatten an 
> iterator of lists, but this is extremely inefficient compared to simply doing 
> flatMap/flatten because it performs many unnecessary object allocations. 
> Simply replacing this foldLeft by a flatMap results in fair performance gains 
> when constructing PartitionStatistics instances for tables with many columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15657) RowEncoder should validate the data type of input object

2016-06-05 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15657.

   Resolution: Fixed
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/13401

> RowEncoder should validate the data type of input object
> 
>
> Key: SPARK-15657
> URL: https://issues.apache.org/jira/browse/SPARK-15657
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12923) Optimize successive dapply() calls in SparkR

2016-06-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316062#comment-15316062
 ] 

Shivaram Venkataraman commented on SPARK-12923:
---

[~sunrui] Can we move this from the UDFs umbrella and retarget this to 2.1.0 ? 
If this change is small it'll also be cool to have this in 2.0

> Optimize successive dapply() calls in SparkR
> 
>
> Key: SPARK-12923
> URL: https://issues.apache.org/jira/browse/SPARK-12923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> For consecutive dapply() calls on a same DataFrame, optimize them to launch R 
> worker once instead of multiple times for performance improvement



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15521) Add high level APIs based on dapply and gapply for easier usage

2016-06-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316060#comment-15316060
 ] 

Shivaram Venkataraman commented on SPARK-15521:
---

[~sunrui] Can we move this from the UDFs umbrella and retarget this to 2.1.0  ?

> Add high level APIs based on dapply and gapply for easier usage
> ---
>
> Key: SPARK-15521
> URL: https://issues.apache.org/jira/browse/SPARK-15521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Sun Rui
>
> dapply() and gapply() of SparkDataFrame are two basic functions. For easier 
> usage to users in the R community, some high level functions can be added 
> based on them.
> Candidates are:
> http://exposurescience.org/heR.doc/library/heR.Misc/html/dapply.html
> http://exposurescience.org/heR.doc/library/stats/html/aggregate.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316050#comment-15316050
 ] 

Yan Chen commented on SPARK-15716:
--

We think the behavior of using local directory is due to this piece of code in 
RDD class:

{code:java}
def getCheckpointFile: Option[String] = {
checkpointData match {
  case Some(reliable: ReliableRDDCheckpointData[T]) => 
reliable.getCheckpointDir
  case _ => None
}
}
{code}

Apparently local directory is not reliable one.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA

[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 10:11 PM:
---

The same thing happens on 1.5.0, 1.6.0, 1.6.1, 2.0.0 preview. However, if I set 
the checkpointing directory to local one when only using one node as both 
master and worker, the issue goes aways. The behavior is the same for all the 
versions, regardless of which environment it's running. We already tried 
standalone and yarn-client mode. 


was (Author: yani.chen):
The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the 
checkpointing directory to local one when only using one node as both master 
and worker, the issue goes aways. The behavior is the same for all the 
versions, regardless of which environment it's running. We already tried 
standalone and yarn-client mode. 

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval 

[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Environment: 
Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
SUSE Linux, CentOS 6 and CentOS 7

  was:Oracle Java 1.8.0_51, SUSE Linux


> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Affects Version/s: 1.5.0

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.5.0, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, 1.8.0_85, 1.8.0_91 and 1.8.0_92
> SUSE Linux, CentOS 6 and CentOS 7
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 10:07 PM:
---

The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the 
checkpointing directory to local one when only using one node as both master 
and worker, the issue goes aways. The behavior is the same for all the 
versions, regardless of which environment it's running. We already tried 
standalone and yarn-client mode. 


was (Author: yani.chen):
The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the 
checkpointing directory to local one when only using one node as both master 
and worker, the issue goes aways.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are 

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316047#comment-15316047
 ] 

Yan Chen commented on SPARK-15716:
--

The memory usage goes up, and finally will make the whole streaming process not 
responding. 

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316045#comment-15316045
 ] 

Yan Chen commented on SPARK-15716:
--

The same thing happens on 1.6.0, 1.6.1, 2.0.0 preview. However, if I set the 
checkpointing directory to local one when only using one node as both master 
and worker, the issue goes aways.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Affects Version/s: 2.0.0
   1.6.0
   1.6.1

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1, 1.6.0, 1.6.1, 2.0.0
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15773) Avoid creating local variable `sc` in examples if possible

2016-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316040#comment-15316040
 ] 

Apache Spark commented on SPARK-15773:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13520

> Avoid creating local variable `sc` in examples if possible
> --
>
> Key: SPARK-15773
> URL: https://issues.apache.org/jira/browse/SPARK-15773
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Instead of using local variable `sc` like the following example, this issue 
> uses `spark.sparkContext`. This makes examples more concise, and also fixes 
> some misleading, i.e., creating SparkContext from SparkSession.
> {code}
> -println("Creating SparkContext")
> -val sc = spark.sparkContext
> -
>  println("Writing local file to DFS")
>  val dfsFilename = dfsDirPath + "/dfs_read_write_test"
> -val fileRDD = sc.parallelize(fileContents)
> +val fileRDD = spark.sparkContext.parallelize(fileContents)
> {code}
> This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15773) Avoid creating local variable `sc` in examples if possible

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15773:


Assignee: (was: Apache Spark)

> Avoid creating local variable `sc` in examples if possible
> --
>
> Key: SPARK-15773
> URL: https://issues.apache.org/jira/browse/SPARK-15773
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Instead of using local variable `sc` like the following example, this issue 
> uses `spark.sparkContext`. This makes examples more concise, and also fixes 
> some misleading, i.e., creating SparkContext from SparkSession.
> {code}
> -println("Creating SparkContext")
> -val sc = spark.sparkContext
> -
>  println("Writing local file to DFS")
>  val dfsFilename = dfsDirPath + "/dfs_read_write_test"
> -val fileRDD = sc.parallelize(fileContents)
> +val fileRDD = spark.sparkContext.parallelize(fileContents)
> {code}
> This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15773) Avoid creating local variable `sc` in examples if possible

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15773:


Assignee: Apache Spark

> Avoid creating local variable `sc` in examples if possible
> --
>
> Key: SPARK-15773
> URL: https://issues.apache.org/jira/browse/SPARK-15773
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Instead of using local variable `sc` like the following example, this issue 
> uses `spark.sparkContext`. This makes examples more concise, and also fixes 
> some misleading, i.e., creating SparkContext from SparkSession.
> {code}
> -println("Creating SparkContext")
> -val sc = spark.sparkContext
> -
>  println("Writing local file to DFS")
>  val dfsFilename = dfsDirPath + "/dfs_read_write_test"
> -val fileRDD = sc.parallelize(fileContents)
> +val fileRDD = spark.sparkContext.parallelize(fileContents)
> {code}
> This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13480) Regression with percentile() + function in GROUP BY

2016-06-05 Thread Abhinit Kumar Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316037#comment-15316037
 ] 

Abhinit Kumar Sinha commented on SPARK-13480:
-

Can I work on this issue?

> Regression with percentile() + function in GROUP BY
> ---
>
> Key: SPARK-13480
> URL: https://issues.apache.org/jira/browse/SPARK-13480
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jaka Jancar
>
> {code}
> SELECT
>   percentile(load_time, 0.50)
> FROM
>   (
> select '2000-01-01' queued_at, 100 load_time
> union all
> select '2000-01-01' queued_at, 110 load_time
> union all
> select '2000-01-01' queued_at, 120 load_time
>   ) t
> GROUP BY
>   year(queued_at)
> {code}
> fails with
> {code}
> Error in SQL statement: SparkException: Job aborted due to stage failure: 
> Task 0 in stage 6067.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 6067.0 (TID 268774, ip-10-0-163-203.ec2.internal): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: year(cast(queued_at#78201 as date))#78209
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:85)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:62)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:62)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$newMutableProjection$1.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.Exchange.org$apache$spark$sql$execution$Exchange$$getPartitionKeyExtractor$1(Exchange.scala:197)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:209)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$3.apply(Exchange.scala:208)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Couldn't find 
> year(cast(queued_at#78201 as date))#78209 in [queued_at#78201,load_time#78202]
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:92)
>   at 
> 

[jira] [Updated] (SPARK-15773) Avoid creating local variable `sc` in examples if possible

2016-06-05 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15773:
--
Summary: Avoid creating local variable `sc` in examples if possible  (was: 
Avoid creating local variable `sc` in examples)

> Avoid creating local variable `sc` in examples if possible
> --
>
> Key: SPARK-15773
> URL: https://issues.apache.org/jira/browse/SPARK-15773
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Instead of using local variable `sc` like the following example, this issue 
> uses `spark.sparkContext`. This makes examples more concise, and also fixes 
> some misleading, i.e., creating SparkContext from SparkSession.
> {code}
> -println("Creating SparkContext")
> -val sc = spark.sparkContext
> -
>  println("Writing local file to DFS")
>  val dfsFilename = dfsDirPath + "/dfs_read_write_test"
> -val fileRDD = sc.parallelize(fileContents)
> +val fileRDD = spark.sparkContext.parallelize(fileContents)
> {code}
> This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15773) Avoid creating local variable `sc` in examples

2016-06-05 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-15773:
-

 Summary: Avoid creating local variable `sc` in examples
 Key: SPARK-15773
 URL: https://issues.apache.org/jira/browse/SPARK-15773
 Project: Spark
  Issue Type: Task
  Components: Examples
Reporter: Dongjoon Hyun
Priority: Minor


Instead of using local variable `sc` like the following example, this issue 
uses `spark.sparkContext`. This makes examples more concise, and also fixes 
some misleading, i.e., creating SparkContext from SparkSession.

{code}
-println("Creating SparkContext")
-val sc = spark.sparkContext
-
 println("Writing local file to DFS")
 val dfsFilename = dfsDirPath + "/dfs_read_write_test"
-val fileRDD = sc.parallelize(fileContents)
+val fileRDD = spark.sparkContext.parallelize(fileContents)
{code}

This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15159) SparkSession R API

2016-06-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316031#comment-15316031
 ] 

Shivaram Venkataraman commented on SPARK-15159:
---

[~felixcheung] Are you working on this ? Just checking as this would be great 
to have for the 2.0 RCs.

cc [~rxin]

> SparkSession R API
> --
>
> Key: SPARK-15159
> URL: https://issues.apache.org/jira/browse/SPARK-15159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Priority: Blocker
>
> HiveContext is to be deprecated in 2.0.  Replace them with 
> SparkSession.builder.enableHiveSupport in SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15772) Improve Scala API docs

2016-06-05 Thread nirav patel (JIRA)
nirav patel created SPARK-15772:
---

 Summary: Improve Scala API docs 
 Key: SPARK-15772
 URL: https://issues.apache.org/jira/browse/SPARK-15772
 Project: Spark
  Issue Type: Improvement
  Components: docs, Documentation
Reporter: nirav patel


Hi, I just found out that spark python APIs are much more elaborate then scala 
counterpart. e.g. 

https://spark.apache.org/docs/1.4.1/api/python/pyspark.html?highlight=treereduce#pyspark.RDD.treeReduce
https://spark.apache.org/docs/1.5.2/api/python/pyspark.html?highlight=treereduce#pyspark.RDD

There are clear explanations of parameters; there are examples as well . I 
think this would be great improvement on Scala API as well. It will make API 
more friendly in first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316017#comment-15316017
 ] 

Xiao Li commented on SPARK-15691:
-

Sure, will follow your suggestions. Thanks!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15316010#comment-15316010
 ] 

Reynold Xin commented on SPARK-15691:
-

[~smilegator] -- do you mind using google docs for the design doc? It'd be 
easier to leave comments in line.

One high level feedback I have is that the current doc is very bottom up: it 
talks about functions, apis, code to move from one place to another. These are 
great, but it'd be great to start the doc with something high level, e.g. what 
are the components/classes that we should have in an ideal end-state.

Another high-level comment (you might be thinking about some of it already but 
it is not super clear from looking at your current doc): often there might be 
multiple alternatives and it is good to discuss their tradeoffs (or if one 
dominates the other). For example, parquet/orc conversion -- I think can of two 
ways to do it. One is to put it in HiveExternalCatalog, and another is to move 
all of those into the more general handling in SessionCatalog. The two have 
their own pros and cons.


> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

2016-06-05 Thread Bijay Kumar Pathak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bijay Kumar Pathak updated SPARK-14761:
---
Comment: was deleted

(was: Hi [~joshrosen], how do we handle the on=None while passing to JVM api, 
since passing None throws {{java.lang.NullPointerException}} in my regression 
test.)

> PySpark DataFrame.join should reject invalid join methods even when join 
> columns are not specified
> --
>
> Key: SPARK-14761
> URL: https://issues.apache.org/jira/browse/SPARK-14761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
> def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when 
> `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
> on = [on]
> if on is None or len(on) == 0:
> jdf = self._jdf.join(other._jdf)
> elif isinstance(on[0], basestring):
> if how is None:
> jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
> else:
> assert isinstance(how, basestring), "how should be basestring"
> jdf = self._jdf.join(other._jdf, self._jseq(on), how)
> else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I 
> think that we should refactor this to first process all arguments and then 
> call the three-argument {{_.jdf.join}}. This would handle the above invalid 
> example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) 
> is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15545) R remove non-exported unused methods, like jsonRDD

2016-06-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315991#comment-15315991
 ] 

Shivaram Venkataraman commented on SPARK-15545:
---

That discussion was before we had UDFs implemented in dapply, gapply etc. I 
think its fine to start working on this. But we might not pull this to 2.0 and 
will probably include the removals in 2.1.0 (it'll also give us a chance to 
announce their deprecation with the 2.0.0 release)

> R remove non-exported unused methods, like jsonRDD
> --
>
> Key: SPARK-15545
> URL: https://issues.apache.org/jira/browse/SPARK-15545
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Need to review what should be removed.
> one reason to not remove this right away is because we have been talking 
> about calling internal methods via `SparkR:::jsonRDD` for this and other RDD 
> methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15770) annotation audit for Experimental and DeveloperApi

2016-06-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15770.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.0.0

> annotation audit for Experimental and DeveloperApi
> --
>
> Key: SPARK-15770
> URL: https://issues.apache.org/jira/browse/SPARK-15770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> 1, remove comments {{:: Experimental ::}} for non-experimental API
> 2, add comments {{:: Experimental ::}} for experimental API
> 3, add comments {{:: DeveloperApi ::}} for developerApi API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Comment: was deleted

(was: Actually I tried in another environment with version 1.6.1, memory leak 
showed up again. We are still investigating in it. The run of 1.6.0 was done on 
hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on 
both yarn and standalone mode on a real cluster. I also tried to run it on a 
single machine with master local[2] with version 1.4.1, memory leak did not 
appear.)

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Chen updated SPARK-15716:
-
Comment: was deleted

(was: We already tried 1.6.0, it does not happen. Could I know what happens if 
the bug only happens on an old version? Do we fix it?)

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 6:29 PM:
--

Actually I tried in another environment with version 1.6.1, memory leak showed 
up again. We are still investigating in it. The run of 1.6.0 was done on 
hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on 
both yarn and standalone mode on a real cluster. I also tried to run it on a 
single machine with master local[2] with version 1.4.1, memory leak did not 
appear.


was (Author: yani.chen):
Actually I tried in another environment with version 1.6.1, memory leak showed 
up again. We are still investigating in it. The run of 1.6.0 was done on 
hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on 
both yarn and standalone mode on a real cluster.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 

[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 6:25 PM:
--

Actually I tried in another environment with version 1.6.1, memory leak showed 
up again. We are still investigating in it. The run of 1.6.0 was done on 
hortonworks sandbox image version 2.4 in virtualbox, the 1.6.1 run is done on 
both yarn and standalone mode on a real cluster.


was (Author: yani.chen):
Actually I tried in another environment with version 1.6.1, memory leak showed 
up again. We are still investigating in it.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I 

[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 6:19 PM:
--

Actually I tried in another environment with version 1.6.1, memory leak showed 
up again. We are still investigating in it.


was (Author: yani.chen):
Actually I tried in another environment with version 1.6.1, the same behaviour 
showed up. We are still investigating in it.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315972#comment-15315972
 ] 

Yan Chen commented on SPARK-15716:
--

Actually I tried in another environment with version 1.6.1, the same behaviour 
showed up. We are still investigating in it.

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315964#comment-15315964
 ] 

Xiao Li commented on SPARK-15691:
-

In the PDF document, all the underlined text is are hyperlinks that point to 
the related contents.

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15691) Refactor and improve Hive support

2016-06-05 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315960#comment-15315960
 ] 

Xiao Li commented on SPARK-15691:
-

For refactoring {{HiveMetastoreCatalog.scala}}, I just finished the design doc. 
The PDF version is available in the following link: 
https://www.dropbox.com/s/tsaoq2joegkdh1h/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.pdf?dl=0
 The original Markdown file can be downloaded via 
https://www.dropbox.com/s/uita63wkdrmuqr2/2016.06.05.HiveMetastoreCatalog.scala%20Refactoring.md?dl=0

Please let me know if this is the right direction and correct me anything is 
not appropriate.  [~rxin]

Thank you very much!

> Refactor and improve Hive support
> -
>
> Key: SPARK-15691
> URL: https://issues.apache.org/jira/browse/SPARK-15691
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Hive support is important to Spark SQL, as many Spark users use it to read 
> from Hive. The current architecture is very difficult to maintain, and this 
> ticket tracks progress towards getting us to a sane state.
> A number of things we want to accomplish are:
> - Move the Hive specific catalog logic into HiveExternalCatalog.
>   -- Remove HiveSessionCatalog. All Hive-related stuff should go into 
> HiveExternalCatalog. This would require moving caching either into 
> HiveExternalCatalog, or just into SessionCatalog.
>   -- Move using properties to store data source options into 
> HiveExternalCatalog.
>   -- Potentially more.
> - Remove HIve's specific ScriptTransform implementation and make it more 
> general so we can put it in sql/core.
> - Implement HiveTableScan (and write path) as a data source, so we don't need 
> a special planner rule for HiveTableScan.
> - Remove HiveSharedState and HiveSessionState.
> One thing that is still unclear to me is how to work with Hive UDF support. 
> We might still need a special planner rule there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name

2016-06-05 Thread Brett Randall (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315938#comment-15315938
 ] 

Brett Randall commented on SPARK-15723:
---

Thanks for merging.  And thanks for the Scala repl test - I can confirm that 
this is driven by a combination of *both* default TimeZone and default Locale - 
the default Locale impacts the interpretation of the short TZ code, which makes 
sense.

{{Australia/Sydney/en_AU}} -> {color:red}*false*{color}
{noformat}
scala -J-Duser.timezone="Australia/Sydney" -J-Duser.country=AU < val time = (new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime
time: Long = 1424413277190

scala> time == 1424470877190L
res0: Boolean = false
{noformat}

{{Australia/Sydney/en_US}} -> {color:red}*false*{color}
{noformat}
scala -J-Duser.timezone="Australia/Sydney" -J-Duser.country=US < val time = (new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime
time: Long = 1424413277190

scala> time == 1424470877190L
res0: Boolean = false
{noformat}

{{America/New_York/en_US}} -> {color:green}*true*{color}
{noformat}
scala -J-Duser.timezone="America/New_York" -J-Duser.country=US < val time = (new 
java.text.SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")).parse("2015-02-20T17:21:17.190EST").getTime
time: Long = 1424470877190

scala> time == 1424470877190L
res0: Boolean = true
{noformat}

So you were correct - this _can_ be disambiguated by applying a bias to the SDF 
in the code, but this would be necessarily a fixed bias, and it has to be done 
with a {{Calendar}} not a {{TimeZone}}:

{code}
sdf.setCalendar(Calendar.getInstance(TimeZone.getTimeZone("America/New_York"), 
new Locale("en_US")))
{code}

I'm not certain this is better or more correct though, but it would remove any 
ambiguity in the short TZ codes - could be documented - all short TZ codes are 
evaluated as if they were in this default TZ/Locale.  That might upset someone 
deploying that wants {{MST}} = Malaysia Standard Time and not Mountain Time.  
Make a note here if you think it is worth pursuing further, but I suspect we 
just have to honour the local env defaults and discourage abbreviated TZs.  And 
the test fix is merged now, so all-good, thanks.

> SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ 
> name
> --
>
> Key: SPARK-15723
> URL: https://issues.apache.org/jira/browse/SPARK-15723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>Assignee: Brett Randall
>Priority: Minor
>  Labels: test
> Fix For: 1.6.2, 2.0.0
>
>
> {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion:
> {code}
> new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be 
> (1424470877190L)
> {code}
> This test is fragile and fails when executing in an environment where the 
> local default timezone causes {{EST}} to be interpreted as something other 
> than US Eastern Standard Time.  If your local timezone is 
> {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get:
> {noformat}
> date parsing *** FAILED ***
> 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29)
> {noformat}
> In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} 
> when interpreting short zone names.  According to the {{TimeZone}} javadoc, 
> they ought not be used:
> {quote}
> Three-letter time zone IDs
> For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
> as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
> because the same abbreviation is often used for multiple time zones (for 
> example, "CST" could be U.S. "Central Standard Time" and "China Standard 
> Time"), and the Java platform can then only recognize one of them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15771:

Description: 
Since SPARK-15617 deprecated {{precision}} in 
{{MulticlassClassificationEvaluator}}, many ML examples broken.
{code}
pyspark.sql.utils.IllegalArgumentException: 
u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName 
given invalid value precision.'
{code}
We should use {{accuracy}} to replace {{precision}} in these examples.

  was:Since SPARK-15617 deprecated {{precision}} in 
{{MulticlassClassificationEvaluator}}, many ML examples broken. We should use 
{{accuracy}} to replace {{precision}} in these examples. 


> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator
> 
>
> Key: SPARK-15771
> URL: https://issues.apache.org/jira/browse/SPARK-15771
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Reporter: Yanbo Liang
>
> Since SPARK-15617 deprecated {{precision}} in 
> {{MulticlassClassificationEvaluator}}, many ML examples broken.
> {code}
> pyspark.sql.utils.IllegalArgumentException: 
> u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName 
> given invalid value precision.'
> {code}
> We should use {{accuracy}} to replace {{precision}} in these examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das edited comment on SPARK-9834 at 6/5/16 4:49 PM:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? I am extending OLS for sparse features and it 
will be great if you can point to the runtime experiments you have done...


was (Author: debasish83):
Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15771:

Description: Since SPARK-15617 deprecated {{precision}} in 
{{MulticlassClassificationEvaluator}}, many ML examples broken. We should use 
{{accuracy}} to replace {{precision}} in these examples.   (was: Many ML 
examples broken since we deprecated `precision` in 
MulticlassClassificationEvaluator.)

> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator
> 
>
> Key: SPARK-15771
> URL: https://issues.apache.org/jira/browse/SPARK-15771
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Reporter: Yanbo Liang
>
> Since SPARK-15617 deprecated {{precision}} in 
> {{MulticlassClassificationEvaluator}}, many ML examples broken. We should use 
> {{accuracy}} to replace {{precision}} in these examples. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15771:


Assignee: Apache Spark

> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator
> 
>
> Key: SPARK-15771
> URL: https://issues.apache.org/jira/browse/SPARK-15771
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315936#comment-15315936
 ] 

Apache Spark commented on SPARK-15771:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13519

> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator
> 
>
> Key: SPARK-15771
> URL: https://issues.apache.org/jira/browse/SPARK-15771
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Reporter: Yanbo Liang
>
> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15771:


Assignee: (was: Apache Spark)

> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator
> 
>
> Key: SPARK-15771
> URL: https://issues.apache.org/jira/browse/SPARK-15771
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML
>Reporter: Yanbo Liang
>
> Many ML examples broken since we deprecated `precision` in 
> MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9834) Normal equation solver for ordinary least squares

2016-06-05 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315935#comment-15315935
 ] 

Debasish Das commented on SPARK-9834:
-

Do you have runtime comparisons that when features <= 4096, OLS using Normal 
Equations is faster than BFGS ? 

> Normal equation solver for ordinary least squares
> -
>
> Key: SPARK-9834
> URL: https://issues.apache.org/jira/browse/SPARK-9834
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Add normal equation solver for ordinary least squares with not many features. 
> The approach requires one pass to collect AtA and Atb, then solve the problem 
> on driver. It works well when the problem is not very ill-conditioned and not 
> having many columns. It also provides R-like summary statistics.
> We can hide this implementation under LinearRegression. It is triggered when 
> there are no more than, e.g., 4096 features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15771) Many ML examples broken since we deprecated `precision` in MulticlassClassificationEvaluator

2016-06-05 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-15771:
---

 Summary: Many ML examples broken since we deprecated `precision` 
in MulticlassClassificationEvaluator
 Key: SPARK-15771
 URL: https://issues.apache.org/jira/browse/SPARK-15771
 Project: Spark
  Issue Type: Bug
  Components: Examples, ML
Reporter: Yanbo Liang


Many ML examples broken since we deprecated `precision` in 
MulticlassClassificationEvaluator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name

2016-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15723.
---
   Resolution: Fixed
 Assignee: Brett Randall
Fix Version/s: 2.0.0
   1.6.2

Resolved by https://github.com/apache/spark/pull/13462

> SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ 
> name
> --
>
> Key: SPARK-15723
> URL: https://issues.apache.org/jira/browse/SPARK-15723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>Assignee: Brett Randall
>Priority: Minor
>  Labels: test
> Fix For: 1.6.2, 2.0.0
>
>
> {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion:
> {code}
> new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be 
> (1424470877190L)
> {code}
> This test is fragile and fails when executing in an environment where the 
> local default timezone causes {{EST}} to be interpreted as something other 
> than US Eastern Standard Time.  If your local timezone is 
> {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get:
> {noformat}
> date parsing *** FAILED ***
> 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29)
> {noformat}
> In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} 
> when interpreting short zone names.  According to the {{TimeZone}} javadoc, 
> they ought not be used:
> {quote}
> Three-letter time zone IDs
> For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
> as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
> because the same abbreviation is often used for multiple time zones (for 
> example, "CST" could be U.S. "Central Standard Time" and "China Standard 
> Time"), and the Java platform can then only recognize one of them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15723) SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ name

2016-06-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315898#comment-15315898
 ] 

Sean Owen commented on SPARK-15723:
---

Yeah I get the ambiguity problem. OK, so you're saying it doesn't help to set 
the time zone on the SimpleDateFormat, that the disambiguation is always 
relative to the machine's time zone and not what is configured for the 
SimpleDateFormat?

I tried this FWIW and it gave the 'right' answer according to the current test, 
not sure what's different:
{code}
$ scala -J-Duser.timezone="Australia/Sydney"
scala> System.getProperty("user.timezone")
res0: String = Australia/Sydney
...
scala> val sdf = new SimpleDateFormat("-MM-dd'T'HH:mm:ss.SSSz")
sdf: java.text.SimpleDateFormat = java.text.SimpleDateFormat@8a9df61b
scala> sdf.parse("2015-02-20T17:21:17.190EST").getTime
res1: Long = 1424470877190
{code}

Anyway I think just patching up the test is OK for now.

> SimpleDateParamSuite test is locale-fragile and relies on deprecated short TZ 
> name
> --
>
> Key: SPARK-15723
> URL: https://issues.apache.org/jira/browse/SPARK-15723
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Brett Randall
>Priority: Minor
>  Labels: test
>
> {{org.apache.spark.status.api.v1.SimpleDateParamSuite}} has this assertion:
> {code}
> new SimpleDateParam("2015-02-20T17:21:17.190EST").timestamp should be 
> (1424470877190L)
> {code}
> This test is fragile and fails when executing in an environment where the 
> local default timezone causes {{EST}} to be interpreted as something other 
> than US Eastern Standard Time.  If your local timezone is 
> {{Australia/Sydney}}, then {{EST}} equates to {{GMT+10}} and you will get:
> {noformat}
> date parsing *** FAILED ***
> 1424413277190 was not equal to 1424470877190 (SimpleDateParamSuite.scala:29)
> {noformat}
> In short, {{SimpleDateFormat}} is sensitive to the local default {{TimeZone}} 
> when interpreting short zone names.  According to the {{TimeZone}} javadoc, 
> they ought not be used:
> {quote}
> Three-letter time zone IDs
> For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
> as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
> because the same abbreviation is often used for multiple time zones (for 
> example, "CST" could be U.S. "Central Standard Time" and "China Standard 
> Time"), and the Java platform can then only recognize one of them.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315895#comment-15315895
 ] 

Sean Owen commented on SPARK-15716:
---

[~HalfVim] are you experiencing out of memory errors? it's normal for the JVM 
to keep using more memory even after GCs; it's of course not normal if it seems 
to use a bunch of memory and then be unable to free it. Example: run with a 
smaller heap? is it still fine? then this isn't a problem

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Comment Edited] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314190#comment-15314190
 ] 

Yan Chen edited comment on SPARK-15716 at 6/5/16 2:05 PM:
--

I also tried to run scala code adapted from the example code in 
https://github.com/apache/spark/blob/branch-1.4/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala,

The code:

{code:java}
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming._

import org.apache.log4j.{Level, Logger}

object StatefulNetworkWordCount {
  def main(args: Array[String]) {

val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
  Logger.getRootLogger.setLevel(Level.WARN)
}

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.sum

  val previousCount = state.getOrElse(0)

  Some(currentCount + previousCount)
}

val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) 
=> {
  iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}

val sparkConf = new SparkConf()
  .setAppName("StatefulNetworkWordCount")
sparkConf.set("spark.streaming.minRememberDuration", "180s")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
sparkConf.set("spark.streaming.unpersist", "true")
sparkConf.set("spark.streaming.ui.retainedBatches", "10")
sparkConf.set("spark.ui.retainedJobs", "10")
sparkConf.set("spark.ui.retainedStages", "10")
sparkConf.set("spark.worker.ui.retainedExecutors", "10")
sparkConf.set("spark.worker.ui.retainedDrivers", "10")
sparkConf.set("spark.sql.ui.retainedExecutions", "10")
sparkConf.set("spark.cleaner.ttl", "240s")

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Milliseconds(args(2).toLong))
ssc.checkpoint(args(1))

// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 
1)))

// Create a ReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited test (eg. generated by 'nc')
val lines = ssc.textFileStream(args(0))//ssc.socketTextStream(args(0), 
args(1).toInt)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))

// Update the cumulative count using updateStateByKey
// This will give a Dstream made of state (which is the cumulative count of 
the words)
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
  new HashPartitioner (ssc.sparkContext.defaultParallelism), true, 
initialRDD)
stateDstream.print()
ssc.start()
ssc.awaitTermination()
  }
}
{code}

The same behavior happened.



was (Author: yani.chen):
I also tried to run scala code adapted from the example code in 
https://github.com/apache/spark/blob/branch-1.4/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala,

The code:

{code:java}
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming._

import org.apache.log4j.{Level, Logger}

object StatefulNetworkWordCount {
  def main(args: Array[String]) {

val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
  Logger.getRootLogger.setLevel(Level.WARN)
}

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.sum

  val previousCount = state.getOrElse(0)

  Some(currentCount + previousCount)
}

val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) 
=> {
  iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}

val sparkConf = new SparkConf()
  .setAppName("StatefulNetworkWordCount")
sparkConf.set("spark.streaming.minRememberDuration", "180s")
sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
sparkConf.set("spark.streaming.unpersist", "true")
sparkConf.set("spark.streaming.ui.retainedBatches", "10")
sparkConf.set("spark.ui.retainedJobs", "10")
sparkConf.set("spark.ui.retainedStages", "10")
sparkConf.set("spark.worker.ui.retainedExecutors", "10")
sparkConf.set("spark.worker.ui.retainedDrivers", "10")
sparkConf.set("spark.sql.ui.retainedExecutions", "10")
sparkConf.set("spark.cleaner.ttl", "240s")

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Milliseconds(args(2).toLong))
ssc.checkpoint(args(1))

// Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 
1)))

// Create a ReceiverInputDStream on target ip:port and count the
// words in input stream of \n 

[jira] [Commented] (SPARK-15716) Memory usage of driver keeps growing up in Spark Streaming

2016-06-05 Thread Yan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315890#comment-15315890
 ] 

Yan Chen commented on SPARK-15716:
--

We already tried 1.6.0, it does not happen. Could I know what happens if the 
bug only happens on an old version? Do we fix it?

> Memory usage of driver keeps growing up in Spark Streaming
> --
>
> Key: SPARK-15716
> URL: https://issues.apache.org/jira/browse/SPARK-15716
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
> Environment: Oracle Java 1.8.0_51, SUSE Linux
>Reporter: Yan Chen
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Code:
> {code:java}
> import org.apache.hadoop.io.LongWritable;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.streaming.Durations;
> import org.apache.spark.streaming.StreamingContext;
> import org.apache.spark.streaming.api.java.JavaPairDStream;
> import org.apache.spark.streaming.api.java.JavaStreamingContext;
> import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
> public class App {
>   public static void main(String[] args) {
> final String input = args[0];
> final String check = args[1];
> final long interval = Long.parseLong(args[2]);
> final SparkConf conf = new SparkConf();
> conf.set("spark.streaming.minRememberDuration", "180s");
> conf.set("spark.streaming.receiver.writeAheadLog.enable", "true");
> conf.set("spark.streaming.unpersist", "true");
> conf.set("spark.streaming.ui.retainedBatches", "10");
> conf.set("spark.ui.retainedJobs", "10");
> conf.set("spark.ui.retainedStages", "10");
> conf.set("spark.worker.ui.retainedExecutors", "10");
> conf.set("spark.worker.ui.retainedDrivers", "10");
> conf.set("spark.sql.ui.retainedExecutions", "10");
> JavaStreamingContextFactory jscf = () -> {
>   SparkContext sc = new SparkContext(conf);
>   sc.setCheckpointDir(check);
>   StreamingContext ssc = new StreamingContext(sc, 
> Durations.milliseconds(interval));
>   JavaStreamingContext jssc = new JavaStreamingContext(ssc);
>   jssc.checkpoint(check);
>   // setup pipeline here
>   JavaPairDStream inputStream =
>   jssc.fileStream(
>   input,
>   LongWritable.class,
>   Text.class,
>   TextInputFormat.class,
>   (filepath) -> Boolean.TRUE,
>   false
>   );
>   JavaPairDStream usbk = inputStream
>   .updateStateByKey((current, state) -> state);
>   usbk.checkpoint(Durations.seconds(10));
>   usbk.foreachRDD(rdd -> {
> rdd.count();
> System.out.println("usbk: " + rdd.toDebugString().split("\n").length);
> return null;
>   });
>   return jssc;
> };
> JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(check, jscf);
> jssc.start();
> jssc.awaitTermination();
>   }
> }
> {code}
> Command used to run the code
> {code:none}
> spark-submit --keytab [keytab] --principal [principal] --class [package].App 
> --master yarn --driver-memory 1g --executor-memory 1G --conf 
> "spark.driver.maxResultSize=0" --conf "spark.logConf=true" --conf 
> "spark.executor.instances=2" --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintFlagsFinal -XX:+PrintReferenceGC 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions" --conf 
> "spark.driver.extraJavaOptions=-Xloggc:/[dir]/memory-gc.log 
> -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy 
> -XX:+UnlockDiagnosticVMOptions" [jar-file-path] file:///[dir-on-nas-drive] 
> [dir-on-hdfs] 200
> {code}
> It's a very simple piece of code, when I ran it, the memory usage of driver 
> keeps going up. There is no file input in our runs. Batch interval is set to 
> 200 milliseconds; processing time for each batch is below 150 milliseconds, 
> while most of which are below 70 milliseconds.
> !http://i.imgur.com/uSzUui6.png!
> The right most four red triangles are full GC's which are triggered manually 
> by using "jcmd pid GC.run" command.
> I also did more experiments in the second and third comment I posted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315887#comment-15315887
 ] 

Ran Haim commented on SPARK-15731:
--

Fine, you can close it - if it works.
I will use a work around.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315885#comment-15315885
 ] 

Sean Owen commented on SPARK-15731:
---

[~ran.h...@optimalplus.com] no, I removed "Target Version". Only committers set 
that, and it doesn't make sense to target a version released already. Given 
[~kevinyu98] sees different behavior on what I presume is a later version, I 
think it's probably fixed in a more up to date version.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315884#comment-15315884
 ] 

Ran Haim commented on SPARK-15731:
--

I am using Mapr distribution and they provide spark 1.5.1 :/ - so not really.
I did specify the version, but you removed it a few days ago :)

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15731:
--
Affects Version/s: 1.5.1

[~ran.h...@optimalplus.com] best to indicate that in the JIRA then. Can you try 
master or at least 2.0 preview? 1.5.1 is pretty old now, and there's a good 
chance this is something fixed along the way.

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15472) Add partitioned `csv`, `json`, `text` format support for FileStreamSink

2016-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315842#comment-15315842
 ] 

Apache Spark commented on SPARK-15472:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13518

> Add partitioned `csv`, `json`, `text` format support for FileStreamSink
> ---
>
> Key: SPARK-15472
> URL: https://issues.apache.org/jira/browse/SPARK-15472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>
> Support for partitioned `parquet` format in FileStreamSink was added in 
> Spark-14716, now let's add support for partitioned `csv`, 'json', `text` 
> format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15433) PySpark core test should not use SerDe from PythonMLLibAPI

2016-06-05 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315797#comment-15315797
 ] 

Liang-Chi Hsieh commented on SPARK-15433:
-

[~davies] Can you set the assignee to me? Thanks.

> PySpark core test should not use SerDe from PythonMLLibAPI
> --
>
> Key: SPARK-15433
> URL: https://issues.apache.org/jira/browse/SPARK-15433
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Reporter: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently PySpark core test uses the SerDe from PythonMLLibAPI which includes 
> many MLlib things. It should use SerDeUtil instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14839:


Assignee: (was: Apache Spark)

> Support for other types as option in OPTIONS clause
> ---
>
> Key: SPARK-14839
> URL: https://issues.apache.org/jira/browse/SPARK-14839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, Spark SQL does not support other types and {{null}} as a value of 
> an options. 
> For example, 
> {code}
> CREATE ...
> USING csv
> OPTIONS (path "your-path", quote null)
> {code}
> throws an exception below
> {code}
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
> org.apache.spark.sql.catalyst.parser.ParseException: 
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
> ...
> {code}
> Currently, Scala API supports to take options with the types, {{String}}, 
> {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other 
> types. I think in this way we can support data sources in a consistent way.
> It looks it is okay to  to provide other types as arguments just like 
> [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] 
> because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] 
> standard mentions options as below:
> {quote}
> An implementation remains conforming even if it provides user op-
> tions to process nonconforming SQL language or to process conform-
> ing SQL language in a nonconforming manner.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-06-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315783#comment-15315783
 ] 

Apache Spark commented on SPARK-14839:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13517

> Support for other types as option in OPTIONS clause
> ---
>
> Key: SPARK-14839
> URL: https://issues.apache.org/jira/browse/SPARK-14839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, Spark SQL does not support other types and {{null}} as a value of 
> an options. 
> For example, 
> {code}
> CREATE ...
> USING csv
> OPTIONS (path "your-path", quote null)
> {code}
> throws an exception below
> {code}
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
> org.apache.spark.sql.catalyst.parser.ParseException: 
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
> ...
> {code}
> Currently, Scala API supports to take options with the types, {{String}}, 
> {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other 
> types. I think in this way we can support data sources in a consistent way.
> It looks it is okay to  to provide other types as arguments just like 
> [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] 
> because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] 
> standard mentions options as below:
> {quote}
> An implementation remains conforming even if it provides user op-
> tions to process nonconforming SQL language or to process conform-
> ing SQL language in a nonconforming manner.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14839) Support for other types as option in OPTIONS clause

2016-06-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14839:


Assignee: Apache Spark

> Support for other types as option in OPTIONS clause
> ---
>
> Key: SPARK-14839
> URL: https://issues.apache.org/jira/browse/SPARK-14839
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This was found in https://github.com/apache/spark/pull/12494.
> Currently, Spark SQL does not support other types and {{null}} as a value of 
> an options. 
> For example, 
> {code}
> CREATE ...
> USING csv
> OPTIONS (path "your-path", quote null)
> {code}
> throws an exception below
> {code}
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
> org.apache.spark.sql.catalyst.parser.ParseException: 
> Unsupported SQL statement
> == SQL ==
>  CREATE TEMPORARY TABLE carsTable (yearMade double, makeName string, 
> modelName string, comments string, grp string) USING csv OPTIONS (path 
> "your-path", quote null)   
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:195)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:764)
> ...
> {code}
> Currently, Scala API supports to take options with the types, {{String}}, 
> {{Long}}, {{Double}} and {{Boolean}} and Python API also supports other 
> types. I think in this way we can support data sources in a consistent way.
> It looks it is okay to  to provide other types as arguments just like 
> [Microsoft SQL|https://msdn.microsoft.com/en-us/library/ms190322.aspx] 
> because [SQL-1992|http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt] 
> standard mentions options as below:
> {quote}
> An implementation remains conforming even if it provides user op-
> tions to process nonconforming SQL language or to process conform-
> ing SQL language in a nonconforming manner.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-05 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315782#comment-15315782
 ] 

Ran Haim commented on SPARK-15731:
--

Hi Kevin.

it is pretty much the same, the only difference I see is that I have created a 
dataframe with a given schema using createDataFrame(Rdd, StructType).
I am using spark 1.5.1 - what version did you use?

> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15720) MLLIB Word2Vec loading large number of vectors in the model results in java.lang.NegativeArraySizeException

2016-06-05 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315776#comment-15315776
 ] 

yuhao yang commented on SPARK-15720:


Thanks for the feedback. I'd try to investigate possible solutions and send 
update here.

> MLLIB Word2Vec loading large number of vectors in the model results in 
> java.lang.NegativeArraySizeException
> ---
>
> Key: SPARK-15720
> URL: https://issues.apache.org/jira/browse/SPARK-15720
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
> Environment: Linux
>Reporter: Rohan G Patil
>
> While loading a large number of pre-trained vectors into Spark MLLIB's 
> Word2Vec model, will result in java.lang.NegativeArraySizeException.
> Code - 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L597
> Test with number of vectors greater than 16777215 with size of each vector 
> 128 or more.
> there is Integer Overflow happening here. Should be an easy fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org