[jira] [Commented] (SPARK-14597) Streaming Listener timing metrics should include time spent in JobGenerator's graph.generateJobs

2016-04-24 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255980#comment-15255980
 ] 

Prashant Sharma commented on SPARK-14597:
-


{quote}
 On digging further, we found that processingDelay is only clocking time spent 
in the ForEachRDD closure of the Streaming application and that JobGenerator's 
graph.generateJobs
{quote}

Correct me, if I am wrong, generateJobs also include the time of scheduling and 
running the actual task and waiting for them to finish. Major portion of time 
should be spent here.

> Streaming Listener timing metrics should include time spent in JobGenerator's 
> graph.generateJobs
> 
>
> Key: SPARK-14597
> URL: https://issues.apache.org/jira/browse/SPARK-14597
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Sachin Aggarwal
>Priority: Minor
>
> While looking to tune our streaming application, the piece of info we were 
> looking for was actual processing time per batch. The 
> StreamingListener.onBatchCompleted event provides a BatchInfo object that 
> provided this information. It provides the following data
>  - processingDelay
>  - schedulingDelay
>  - totalDelay
>  - Submission Time
>  The above are essentially calculated from the streaming JobScheduler 
> clocking the processingStartTime and processingEndTime for each JobSet. 
> Another metric available is submissionTime which is when a Jobset was put on 
> the Streaming Scheduler's Queue. 
>  
> So we took processing delay as our actual processing time per batch. However 
> to maintain a stable streaming application, we found that the our batch 
> interval had to be a little less than DOUBLE of the processingDelay metric 
> reported. (We are using a DirectKafkaInputStream). On digging further, we 
> found that processingDelay is only clocking time spent in the ForEachRDD 
> closure of the Streaming application and that JobGenerator's 
> graph.generateJobs 
> (https://github.com/apache/spark/blob/branch-1.6/streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala#L248)
>  method takes a significant more amount of time.
>  Thus a true reflection of processing time is
>  a - Time spent in JobGenerator's Job Queue (JobGeneratorQueueDelay)
>  b - Time spent in JobGenerator's graph.generateJobs (JobSetCreationDelay)
>  c - Time spent in JobScheduler Queue for a Jobset (existing schedulingDelay 
> metric)
>  d - Time spent in Jobset's job run (existing processingDelay metric)
>  
>  Additionally a JobGeneratorQueue delay (#a) could be due to either 
> graph.generateJobs taking longer than batchInterval or other JobGenerator 
> events like checkpointing adding up time. Thus it would be beneficial to 
> report time taken by the checkpointing Job as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation

2016-04-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-14193:
-
Description: 
This ticket describes an opportunity to skip unnecessary sorts if input data 
have been already ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

{code}
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
{code}

If you say `df2.sort("a")`, the current spark generates a plan like;
{code}
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

{code}
Since the current implementation cannot tell a difference between global sorted 
columns and partition-locally sorted ones from `SparkPan#outputOrdering`.


  was:
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

{code}
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
{code}

If you say `df2.sort("a")`, the current spark generates a plan like;
{code}
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

{code}
This ticket targets at removing this unncessary sort.



> Skip unnecessary sorts if input data have been already ordered in 
> InMemoryRelation
> --
>
> Key: SPARK-14193
> URL: https://issues.apache.org/jira/browse/SPARK-14193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> This ticket describes an opportunity to skip unnecessary sorts if input data 
> have been already ordered in InMemoryTable.
> Let's say we have a cached table with column 'a' sorted;
> {code}
> val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
> val df2 = df1.sort("a").cache
> df2.show // just cache data
> {code}
> If you say `df2.sort("a")`, the current spark generates a plan like;
> {code}
> == Physical Plan ==
> Sort [a#13 ASC], true, 0
> +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
> 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, 
> None
> {code}
> Since the current implementation cannot tell a difference between global 
> sorted columns and partition-locally sorted ones from 
> `SparkPan#outputOrdering`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14751) SparkR fails on Cassandra map with numeric key

2016-04-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255959#comment-15255959
 ] 

Michał Matłoka commented on SPARK-14751:


Even toString with warning would be much better than current ERROR with message 
that does not point why it does not work :)

> SparkR fails on Cassandra map with numeric key
> --
>
> Key: SPARK-14751
> URL: https://issues.apache.org/jira/browse/SPARK-14751
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Michał Matłoka
>
> Hi,
> I have created an issue for spark  cassandra connector ( 
> https://datastax-oss.atlassian.net/projects/SPARKC/issues/SPARKC-366 ) but 
> after a bit of digging it seems this is a better place for this issue:
> {code}
> CREATE TABLE test.map (
> id text,
> somemap map,
> PRIMARY KEY (id)
> );
> insert into test.map(id, somemap) values ('a', { 0 : 12 }); 
> {code}
> {code}
>   sqlContext <- sparkRSQL.init(sc)
>   test <-read.df(sqlContext,  source = "org.apache.spark.sql.cassandra",  
> keyspace = "test", table = "map")
>   head(test)
> {code}
> Results in:
> {code}
> 16/04/19 14:47:02 ERROR RBackendHandler: dfToCols on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in readBin(con, raw(), stringLen, endian = "big") :
>   invalid 'n' argument
> {code}
> Problem occurs even for int key. For text key it works. Every scenario works 
> under scala & python.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14759) After join one cannot drop dynamically added column

2016-04-24 Thread praveen dareddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255935#comment-15255935
 ] 

praveen dareddy commented on SPARK-14759:
-

Hi Tomasz,

I am new to Spark but would like to help on this issue. I have the environment 
set up in my local system. I have quite recently started going through the code 
base and am eager to contribute. Can you point me towards the specific module i 
need to understand to solve this issue?

Thanks,
Red

> After join one cannot drop dynamically added column
> ---
>
> Key: SPARK-14759
> URL: https://issues.apache.org/jira/browse/SPARK-14759
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> running following code:
> {code}
> from pyspark.sql.functions import *
> df1 = sqlContext.createDataFrame([(1,10,)], ['any','hour'])
> df2 = sqlContext.createDataFrame([(1,)], ['any']).withColumn('hour',lit(10))
> j = df1.join(df2,[df1.hour == df2.hour],how='left')
> print("columns after join:{0}".format(j.columns))
> jj = j.drop(df2.hour)
> print("columns after removing 'hour':{0}".format(jj.columns))
> {code}
> should show that after join and remove df2.hour I end up with only one 'hour' 
> column in dataframe.
> Unfortunately this column is not dropped.
> {code}
> columns after join:['any', 'hour', 'any', 'hour']
> columns after removing 'hour': ['any', 'hour', 'any', 'hour']
> {code}
> I found out that it behaves like that only when the column is added 
> dynamically before the join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits

2016-04-24 Thread fang fang chen (JIRA)
fang fang chen created SPARK-14887:
--

 Summary: Generated SpecificUnsafeProjection Exceeds JVM Code Size 
Limits
 Key: SPARK-14887
 URL: https://issues.apache.org/jira/browse/SPARK-14887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: fang fang chen


Similiar issue with SPARK-14138 and SPARK-8443:
With large sql syntax(673K), following error happened:

Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14870) NPE in generate aggregate

2016-04-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14870.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12651
[https://github.com/apache/spark/pull/12651]

> NPE in generate aggregate
> -
>
> Key: SPARK-14870
> URL: https://issues.apache.org/jira/browse/SPARK-14870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> When ran TPCDS Q14a
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
> (TID 234, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewE

[jira] [Resolved] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-04-24 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14881.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12648
[https://github.com/apache/spark/pull/12648]

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14883) Fix wrong R examples and make them up-to-date

2016-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-14883.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12649
[https://github.com/apache/spark/pull/12649]

> Fix wrong R examples and make them up-to-date
> -
>
> Key: SPARK-14883
> URL: https://issues.apache.org/jira/browse/SPARK-14883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to fix some errors in R examples and make them up-to-date in 
> docs and example modules.
> - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if 
> needed. However, `lapply` is private now. The correct usage will be added 
> later.
> {code}
> -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
> ...
> {code}
> - Fix the wrong example in Section `Generic Load/Save Functions` of 
> `docs/sql-programming-guide.md` for consistency.
> {code}
> -df <- loadDF(sqlContext, "people.parquet")
> -saveDF(select(df, "name", "age"), "namesAndAges.parquet")
> +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
> +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
> {code}
> - Fix datatypes in `sparkr.md`.
> {code}
> -#  |-- age: integer (nullable = true)
> +#  |-- age: long (nullable = true)
> {code}
> {code}
> -## DataFrame[eruptions:double, waiting:double]
> +## SparkDataFrame[eruptions:double, waiting:double]
> {code}
> - Update data results
> {code}
>  head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
>  ##  waiting count
> -##1  8113
> -##2  60 6
> -##3  68 1
> +##1  70 4
> +##2  67 1
> +##3  69 2
> {code}
> - Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
> read.parquet
> {code}
> df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
> Warning message:
> 'jsonFile' is deprecated.
> Use 'read.json' instead.
> See help("Deprecated") 
> {code}
> - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
> saveAsParquetFile -> write.parquet
> - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
> `data-manipulation.R`.
> - Other minor syntax fixes and typos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14883) Fix wrong R examples and make them up-to-date

2016-04-24 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-14883:
--
Assignee: Dongjoon Hyun

> Fix wrong R examples and make them up-to-date
> -
>
> Key: SPARK-14883
> URL: https://issues.apache.org/jira/browse/SPARK-14883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue aims to fix some errors in R examples and make them up-to-date in 
> docs and example modules.
> - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if 
> needed. However, `lapply` is private now. The correct usage will be added 
> later.
> {code}
> -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
> ...
> {code}
> - Fix the wrong example in Section `Generic Load/Save Functions` of 
> `docs/sql-programming-guide.md` for consistency.
> {code}
> -df <- loadDF(sqlContext, "people.parquet")
> -saveDF(select(df, "name", "age"), "namesAndAges.parquet")
> +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
> +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
> {code}
> - Fix datatypes in `sparkr.md`.
> {code}
> -#  |-- age: integer (nullable = true)
> +#  |-- age: long (nullable = true)
> {code}
> {code}
> -## DataFrame[eruptions:double, waiting:double]
> +## SparkDataFrame[eruptions:double, waiting:double]
> {code}
> - Update data results
> {code}
>  head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
>  ##  waiting count
> -##1  8113
> -##2  60 6
> -##3  68 1
> +##1  70 4
> +##2  67 1
> +##3  69 2
> {code}
> - Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
> read.parquet
> {code}
> df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
> Warning message:
> 'jsonFile' is deprecated.
> Use 'read.json' instead.
> See help("Deprecated") 
> {code}
> - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
> saveAsParquetFile -> write.parquet
> - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
> `data-manipulation.R`.
> - Other minor syntax fixes and typos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-24 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospaced}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle id:

| shuffle id | stage | parents |
| 0 | ShuffleMapStage 2 | List() |
| 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.


  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospace}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle id:

| shuffle id | stage | parents |
| 0 | ShuffleMapStage 2 | List() |
| 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see 
> this in {{monospaced}} font):
> {noformat}
>   <
> /   \
> [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
>\   /
>  <
> {noformat}
> {{DAGScheduler}} generates the following stages and their parents for each 
> shuffle id:
> | shuffle id | stage | parents |
> | 0 | ShuffleMapStage 2 | List() |
> | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
> | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
> | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
> | \- | ResultStage 6 | List(ShuffleMapStage 5) |
> The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
> for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
> {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
> {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To un

[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-24 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospace}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle id:

| shuffle id | stage | parents |
| 0 | ShuffleMapStage 2 | List() |
| 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.


  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows:

{code}
  <
 /   \
 [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
\   /
  <
{code}

I added the sample RDD graph to show the illegal stage graph to 
{{DAGSchedulerSuite}} and then fixed it.


> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see 
> this in {{monospace}} font):
> {noformat}
>   <
> /   \
> [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
>\   /
>  <
> {noformat}
> {{DAGScheduler}} generates the following stages and their parents for each 
> shuffle id:
> | shuffle id | stage | parents |
> | 0 | ShuffleMapStage 2 | List() |
> | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
> | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
> | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
> | \- | ResultStage 6 | List(ShuffleMapStage 5) |
> The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
> for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
> {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
> {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Bhanawat reopened SPARK-13693:
-

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255881#comment-15255881
 ] 

Hemant Bhanawat commented on SPARK-13693:
-

Latest Jenkins builds are failing with this issue. See:
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2863
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2865

> Flaky test: o.a.s.streaming.MapWithStateSuite
> -
>
> Key: SPARK-13693
> URL: https://issues.apache.org/jira/browse/SPARK-13693
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> Fixed the following flaky test:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/256/testReport/junit/org.apache.spark.streaming/MapWithStateSuite/_It_is_not_a_test_/
> {code}
> sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/streaming/checkpoint/spark-e97794a8-b940-4b21-8685-bf1221f9444d
>   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:934)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply$mcV$sp(MapWithStateSuite.scala:47)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
>   at 
> org.apache.spark.streaming.MapWithStateSuite$$anonfun$2.apply(MapWithStateSuite.scala:45)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14883) Fix wrong R examples and make them up-to-date

2016-04-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14883:
--
Description: 
This issue aims to fix some errors in R examples and make them up-to-date in 
docs and example modules.

- Remove the wrong usage of map. We need to use `lapply` in `SparkR` if needed. 
However, `lapply` is private now. The correct usage will be added later.
{code}
-teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
...
{code}

- Fix the wrong example in Section `Generic Load/Save Functions` of 
`docs/sql-programming-guide.md` for consistency.
{code}
-df <- loadDF(sqlContext, "people.parquet")
-saveDF(select(df, "name", "age"), "namesAndAges.parquet")
+df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
{code}

- Fix datatypes in `sparkr.md`.
{code}
-#  |-- age: integer (nullable = true)
+#  |-- age: long (nullable = true)
{code}
{code}
-## DataFrame[eruptions:double, waiting:double]
+## SparkDataFrame[eruptions:double, waiting:double]
{code}

- Update data results
{code}
 head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
 ##  waiting count
-##1  8113
-##2  60 6
-##3  68 1
+##1  70 4
+##2  67 1
+##3  69 2
{code}

- Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
read.parquet
{code}
df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
Warning message:
'jsonFile' is deprecated.
Use 'read.json' instead.
See help("Deprecated") 
{code}

- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
saveAsParquetFile -> write.parquet

- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
`data-manipulation.R`.

- Other minor syntax fixes and typos.

  was:
This issue aims to fix some errors in R examples and make them up-to-date in 
docs and example modules.

- Fix the wrong usage of map. We need to use `lapply` if needed. However, the 
usage of `lapply` also needs to be reviewed since it's private.
{code}
-teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
+teenNames <- SparkR:::lapply(teenagers, function(p) { paste("Name:", p$name) })
{code}

- Fix the wrong example in Section `Generic Load/Save Functions` of 
`docs/sql-programming-guide.md` for consistency.
{code}
-df <- loadDF(sqlContext, "people.parquet")
-saveDF(select(df, "name", "age"), "namesAndAges.parquet")
+df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
{code}

- Fix datatypes in `sparkr.md`.
{code}
-#  |-- age: integer (nullable = true)
+#  |-- age: long (nullable = true)
{code}
{code}
-## DataFrame[eruptions:double, waiting:double]
+## SparkDataFrame[eruptions:double, waiting:double]
{code}

- Update data results
{code}
 head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
 ##  waiting count
-##1  8113
-##2  60 6
-##3  68 1
+##1  70 4
+##2  67 1
+##3  69 2
{code}

- Replace deprecated functions: jsonFile -> read.json, parquetFile -> 
read.parquet
{code}
df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
Warning message:
'jsonFile' is deprecated.
Use 'read.json' instead.
See help("Deprecated") 
{code}

- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, 
saveAsParquetFile -> write.parquet

- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and 
`data-manipulation.R`.

- Other minor syntax fixes and typos.


> Fix wrong R examples and make them up-to-date
> -
>
> Key: SPARK-14883
> URL: https://issues.apache.org/jira/browse/SPARK-14883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples
>Reporter: Dongjoon Hyun
>
> This issue aims to fix some errors in R examples and make them up-to-date in 
> docs and example modules.
> - Remove the wrong usage of map. We need to use `lapply` in `SparkR` if 
> needed. However, `lapply` is private now. The correct usage will be added 
> later.
> {code}
> -teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
> ...
> {code}
> - Fix the wrong example in Section `Generic Load/Save Functions` of 
> `docs/sql-programming-guide.md` for consistency.
> {code}
> -df <- loadDF(sqlContext, "people.parquet")
> -saveDF(select(df, "name", "age"), "namesAndAges.parquet")
> +df <- read.df(sqlContext, "examples/src/main/resources/users.parquet")
> +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.parquet")
> {code}
> - Fix datatypes in `sparkr.md`.
> {code}
> -#  |-- age: integer (nullable = true)
> +#  |-- age: long (nullable = true)
> {code}
> {code}
> -## DataFrame[eruptions:double, w

[jira] [Resolved] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType

2016-04-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14885.
-
   Resolution: Fixed
 Assignee: Yin Huai
Fix Version/s: 2.0.0

> When creating a CatalogColumn, we should use catalogString of a DataType
> 
>
> Key: SPARK-14885
> URL: https://issues.apache.org/jira/browse/SPARK-14885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Minor
> Fix For: 2.0.0
>
>
> Right now, the data type field of a CatalogColumn is using the string 
> representation. When we create this string from a DataType object, there are 
> places where we use simpleString. Although catalogString is the same as 
> simpleString right now, it is better to use catalogString. So, we will not 
> introduce issues when we change the semantic of simpleString or the 
> implementation of catalogString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14868) Enable NewLineAtEofChecker in checkstyle and fix lint-java errors

2016-04-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14868.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Enable NewLineAtEofChecker in checkstyle and fix lint-java errors
> -
>
> Key: SPARK-14868
> URL: https://issues.apache.org/jira/browse/SPARK-14868
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Spark uses `NewLineAtEofChecker` rule in Scala by ScalaStyle. And, most Java 
> code also comply with the rule. This issue aims to enforce the same rule 
> `NewlineAtEndOfFile` by CheckStyle explicitly. Also, this fixes lint-java 
> errors since SPARK-14465.
> This issue does the following items.
> * Adds a new line at the end of the files (19 files)
> * Fixes 25 lint-java errors (12 RedundantModifier, 6 ArrayTypeStyle, 2 
> LineLength, 2 UnusedImports, 2 RegexpSingleline, 1 ModifierOrder)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-24 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows:

{code}
  <
 /   \
 [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
\   /
  <
{code}

I added the sample RDD graph to show the illegal stage graph to 
{{DAGSchedulerSuite}} and then fixed it.

  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

I added the sample RDD graph to show the illegal stage graph to 
{{DAGSchedulerSuite}} and then fixed it.


> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows:
> {code}
>   <
>  /   \
>  [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
> \   /
>   <
> {code}
> I added the sample RDD graph to show the illegal stage graph to 
> {{DAGSchedulerSuite}} and then fixed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-24 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255840#comment-15255840
 ] 

holdenk commented on SPARK-14654:
-

Ah yes, in DAGScheduler right now we cast it to [Any, Any] when we get it from 
the map so I suppose the type safety would be lost and we just hide the cast 
from the developer which probably isn't worth it.

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException

2016-04-24 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-14886:
---
Description: 
 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

"if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException when 
pred's size less then k.
That meas the true relevant documents has less size then the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)




  was:
 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

   if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when 
the true relevant documents has less size the the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)





> RankingMetrics.ndcgAt  throw  java.lang.ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-14886
> URL: https://issues.apache.org/jira/browse/SPARK-14886
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
>  @Since("1.2.0")
>   def ndcgAt(k: Int): Double = {
> require(k > 0, "ranking position k should be positive")
> predictionAndLabels.map { case (pred, lab) =>
>   val labSet = lab.toSet
>   if (labSet.nonEmpty) {
> val labSetSize = labSet.size
> val n = math.min(math.max(pred.length, labSetSize), k)
> var maxDcg = 0.0
> var dcg = 0.0
> var i = 0
> while (i < n) {
>   val gain = 1.0 / math.log(i + 2)
>   if (labSet.contains(pred(i))) {
> dcg += gain
>   }
>   if (i < labSetSize) {
> maxDcg += gain
>   }
>   i += 1
> }
> dcg / maxDcg
>   } else {
> logWarning("Empty ground truth set, check input data")
> 0.0
>   }
> }.mean()
>   }
> "if (labSet.contains(pred(i)))" will throw ArrayIndexOutOfBoundsException 
> when pred's size less then k.
> That meas the true relevant documents has less size then the param k.
> just try this with sample_movielens_data.txt
> precisionAt is ok just because it has 
> val n = math.min(pred.length, k)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14886) RankingMetrics.ndcgAt throw java.lang.ArrayIndexOutOfBoundsException

2016-04-24 Thread lichenglin (JIRA)
lichenglin created SPARK-14886:
--

 Summary: RankingMetrics.ndcgAt  throw  
java.lang.ArrayIndexOutOfBoundsException
 Key: SPARK-14886
 URL: https://issues.apache.org/jira/browse/SPARK-14886
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.6.1
Reporter: lichenglin


 @Since("1.2.0")
  def ndcgAt(k: Int): Double = {
require(k > 0, "ranking position k should be positive")
predictionAndLabels.map { case (pred, lab) =>
  val labSet = lab.toSet

  if (labSet.nonEmpty) {
val labSetSize = labSet.size
val n = math.min(math.max(pred.length, labSetSize), k)
var maxDcg = 0.0
var dcg = 0.0
var i = 0
while (i < n) {
  val gain = 1.0 / math.log(i + 2)
  if (labSet.contains(pred(i))) {
dcg += gain
  }
  if (i < labSetSize) {
maxDcg += gain
  }
  i += 1
}
dcg / maxDcg
  } else {
logWarning("Empty ground truth set, check input data")
0.0
  }
}.mean()
  }

   if (labSet.contains(pred(i))) will throw ArrayIndexOutOfBoundsException when 
the true relevant documents has less size the the param k.
just try this with sample_movielens_data.txt
precisionAt is ok just because it has 
val n = math.min(pred.length, k)






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255820#comment-15255820
 ] 

Reynold Xin commented on SPARK-14654:
-

But Holden the call sites in Spark are already erasing the types, because the 
accumulators are put into a collection.


> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14654) New accumulator API

2016-04-24 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255819#comment-15255819
 ] 

holdenk commented on SPARK-14654:
-

Even if merge isn't supposed to be called by the end user it would be nice to 
have it be type safe - we also replace a conditional with reflection or cast we 
expect the developers to remember with a type parameter. By adding this I'd 
argue we reduce complexity/cost of the API.

> New accumulator API
> ---
>
> Key: SPARK-14654
> URL: https://issues.apache.org/jira/browse/SPARK-14654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The current accumulator API has a few problems:
> 1. Its type hierarchy is very complicated, with Accumulator, Accumulable, 
> AccumulatorParam, AccumulableParam, etc.
> 2. The intermediate buffer type must be the same as the output type, so there 
> is no way to define an accumulator that computes averages.
> 3. It is very difficult to specialize the methods, leading to excessive 
> boxing and making accumulators bad for metrics that change for each record.
> 4. There is not a single coherent API that works for both Java and Scala.
> This is a proposed new API that addresses all of the above. In this new API:
> 1. There is only a single class (Accumulator) that is user facing
> 2. The intermediate value is stored in the accumulator itself and can be 
> different from the output type.
> 3. Concrete implementations can provide its own specialized methods.
> 4. Designed to work for both Java and Scala.
> {code}
> abstract class Accumulator[IN, OUT] extends Serializable {
>   def isRegistered: Boolean = ...
>   def register(metadata: AccumulatorMetadata): Unit = ...
>   def metadata: AccumulatorMetadata = ...
>   def reset(): Unit
>   def add(v: IN): Unit
>   def merge(other: Accumulator[IN, OUT]): Unit
>   def value: OUT
>   def localValue: OUT = value
>   final def registerAccumulatorOnExecutor(): Unit = {
> // Automatically register the accumulator when it is deserialized with 
> the task closure.
> // This is for external accumulators and internal ones that do not 
> represent task level
> // metrics, e.g. internal SQL metrics, which are per-operator.
> val taskContext = TaskContext.get()
> if (taskContext != null) {
>   taskContext.registerAccumulator(this)
> }
>   }
>   // Called by Java when deserializing an object
>   private def readObject(in: ObjectInputStream): Unit = 
> Utils.tryOrIOException {
> in.defaultReadObject()
> registerAccumulator()
>   }
> }
> {code}
> Metadata, provided by Spark after registration:
> {code}
> class AccumulatorMetadata(
>   val id: Long,
>   val name: Option[String],
>   val countFailedValues: Boolean
> ) extends Serializable
> {code}
> and an implementation that also offers specialized getters and setters
> {code}
> class LongAccumulator extends Accumulator[jl.Long, jl.Long] {
>   private[this] var _sum = 0L
>   override def reset(): Unit = _sum = 0L
>   override def add(v: jl.Long): Unit = {
> _sum += v
>   }
>   override def merge(other: Accumulator[jl.Long, jl.Long]): Unit = other 
> match {
> case o: LongAccumulator => _sum += o.sum
> case _ => throw new UnsupportedOperationException(
>   s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
>   }
>   override def value: jl.Long = _sum
>   def sum: Long = _sum
> }
> {code}
> and SparkContext...
> {code}
> class SparkContext {
>   ...
>   def newLongAccumulator(): LongAccumulator
>   def newLongAccumulator(name: Long): LongAccumulator
>   def newLongAccumulator(name: Long, dedup: Boolean): LongAccumulator
>   def registerAccumulator[IN, OUT](acc: Accumulator[IN, OUT]): 
> Accumulator[IN, OUT]
>   ...
> }
> {code}
> To use it ...
> {code}
> val acc = sc.newLongAccumulator()
> sc.parallelize(1 to 1000).map { i =>
>   acc.add(1)
>   i
> }
> {code}
> A work-in-progress prototype here: 
> https://github.com/rxin/spark/tree/accumulator-refactor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14845:


Assignee: Apache Spark

> spark.files in properties file is not distributed to driver in yarn-cluster 
> mode
> 
>
> Key: SPARK-14845
> URL: https://issues.apache.org/jira/browse/SPARK-14845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> I use the following command to run SparkPi example. And in this property 
> file, define spark.files. It turns out that in yarn-cluster mode this 
> README.md is only distirbuted to executor but not in driver. 
> content of properties file
> {noformat}
> spark.files=/Users/jzhang/github/spark/README.md
> {noformat}
> launch command
> {noformat}
> bin/spark-submit --master yarn-cluster --properties-file a.properties --class 
> org.apache.spark.examples.SparkPi 
> examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255810#comment-15255810
 ] 

Apache Spark commented on SPARK-14845:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/12656

> spark.files in properties file is not distributed to driver in yarn-cluster 
> mode
> 
>
> Key: SPARK-14845
> URL: https://issues.apache.org/jira/browse/SPARK-14845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Reporter: Jeff Zhang
>Priority: Minor
>
> I use the following command to run SparkPi example. And in this property 
> file, define spark.files. It turns out that in yarn-cluster mode this 
> README.md is only distirbuted to executor but not in driver. 
> content of properties file
> {noformat}
> spark.files=/Users/jzhang/github/spark/README.md
> {noformat}
> launch command
> {noformat}
> bin/spark-submit --master yarn-cluster --properties-file a.properties --class 
> org.apache.spark.examples.SparkPi 
> examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14845:


Assignee: (was: Apache Spark)

> spark.files in properties file is not distributed to driver in yarn-cluster 
> mode
> 
>
> Key: SPARK-14845
> URL: https://issues.apache.org/jira/browse/SPARK-14845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Reporter: Jeff Zhang
>Priority: Minor
>
> I use the following command to run SparkPi example. And in this property 
> file, define spark.files. It turns out that in yarn-cluster mode this 
> README.md is only distirbuted to executor but not in driver. 
> content of properties file
> {noformat}
> spark.files=/Users/jzhang/github/spark/README.md
> {noformat}
> launch command
> {noformat}
> bin/spark-submit --master yarn-cluster --properties-file a.properties --class 
> org.apache.spark.examples.SparkPi 
> examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14845) spark.files in properties file is not distributed to driver in yarn-cluster mode

2016-04-24 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-14845:
---
Component/s: Spark Submit

> spark.files in properties file is not distributed to driver in yarn-cluster 
> mode
> 
>
> Key: SPARK-14845
> URL: https://issues.apache.org/jira/browse/SPARK-14845
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Reporter: Jeff Zhang
>Priority: Minor
>
> I use the following command to run SparkPi example. And in this property 
> file, define spark.files. It turns out that in yarn-cluster mode this 
> README.md is only distirbuted to executor but not in driver. 
> content of properties file
> {noformat}
> spark.files=/Users/jzhang/github/spark/README.md
> {noformat}
> launch command
> {noformat}
> bin/spark-submit --master yarn-cluster --properties-file a.properties --class 
> org.apache.spark.examples.SparkPi 
> examples/target/original-spark-examples_2.11-2.0.0-SNAPSHOT.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14876) SparkSession should be case insensitive by default

2016-04-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14876.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> SparkSession should be case insensitive by default
> --
>
> Key: SPARK-14876
> URL: https://issues.apache.org/jira/browse/SPARK-14876
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This would match most database systems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255798#comment-15255798
 ] 

Apache Spark commented on SPARK-13902:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12655

> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> I added the sample RDD graph to show the illegal stage graph to 
> {{DAGSchedulerSuite}} and then fixed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255765#comment-15255765
 ] 

Jeff Zhang edited comment on SPARK-14834 at 4/25/16 1:09 AM:
-

This is for enforcing user to add python doc when adding new python api with 
@since annotation. But I think about it again, this is only suitable for adding 
new api for existing python module. If it is a new python module migrating from 
scala api, python doc is not mandatory. 


was (Author: zjffdu):
This is for enforcing user to add python doc when adding new python api with 
@since annotation. 

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14834) Force adding doc for new api in pyspark with @since annotation

2016-04-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255765#comment-15255765
 ] 

Jeff Zhang commented on SPARK-14834:


This is for enforcing user to add python doc when adding new python api with 
@since annotation. 

> Force adding doc for new api in pyspark with @since annotation
> --
>
> Key: SPARK-14834
> URL: https://issues.apache.org/jira/browse/SPARK-14834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-24 Thread Evan Oman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255734#comment-15255734
 ] 

Evan Oman commented on SPARK-14693:
---

I do get the following error after waiting for two hours:

{code}
java.rmi.RemoteException: java.util.concurrent.TimeoutException: Timed out 
retrying send to http://10.210.224.74:7070: 2 hours; nested exception is: 
java.util.concurrent.TimeoutException: Timed out retrying send to 
http://10.210.224.74:7070: 2 hours
at 
com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:71)
at 
com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:40)
at 
com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:189)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:228)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.(FileBasedWriteAheadLog.scala:72)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:140)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:98)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:252)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:252)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.(ReceivedBlockTracker.scala:75)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker.(ReceiverTracker.scala:106)
at 
org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:80)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:610)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606)
at 
org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:606)
at ... run in separate thread using org.apache.spark.util.ThreadUtils 
... ()
at 
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:606)
at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
Caused by: java.util.concurrent.TimeoutException: Timed out retrying send to 
http://10.210.224.74:7070: 2 hours
at 
com.databricks.rpc.ReliableJettyClient.retryOnNetworkError(ReliableJettyClient.scala:138)
at 
com.databricks.rpc.ReliableJettyClient.sendIdempotent(ReliableJettyClient.scala:46)
at 
com.databricks.backend.daemon.data.client.DbfsClient.doSend(DbfsClient.scala:83)
at 
com.databricks.backend.daemon.data.client.DbfsClient.send0(DbfsClient.scala:60)
at 
com.databricks.backend.daemon.data.client.DbfsClient.sendIdempotent(DbfsClient.scala:40)
at 
com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:189)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:228)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.(FileBasedWriteAheadLog.scala:72)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:141)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:140)
at 
org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:98)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:252)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:252)
at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.(ReceivedBlockTracker.scala:75)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker.(Recei

[jira] [Assigned] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14885:


Assignee: Apache Spark

> When creating a CatalogColumn, we should use catalogString of a DataType
> 
>
> Key: SPARK-14885
> URL: https://issues.apache.org/jira/browse/SPARK-14885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Minor
>
> Right now, the data type field of a CatalogColumn is using the string 
> representation. When we create this string from a DataType object, there are 
> places where we use simpleString. Although catalogString is the same as 
> simpleString right now, it is better to use catalogString. So, we will not 
> introduce issues when we change the semantic of simpleString or the 
> implementation of catalogString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14885:


Assignee: (was: Apache Spark)

> When creating a CatalogColumn, we should use catalogString of a DataType
> 
>
> Key: SPARK-14885
> URL: https://issues.apache.org/jira/browse/SPARK-14885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, the data type field of a CatalogColumn is using the string 
> representation. When we create this string from a DataType object, there are 
> places where we use simpleString. Although catalogString is the same as 
> simpleString right now, it is better to use catalogString. So, we will not 
> introduce issues when we change the semantic of simpleString or the 
> implementation of catalogString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255708#comment-15255708
 ] 

Apache Spark commented on SPARK-14885:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12654

> When creating a CatalogColumn, we should use catalogString of a DataType
> 
>
> Key: SPARK-14885
> URL: https://issues.apache.org/jira/browse/SPARK-14885
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> Right now, the data type field of a CatalogColumn is using the string 
> representation. When we create this string from a DataType object, there are 
> places where we use simpleString. Although catalogString is the same as 
> simpleString right now, it is better to use catalogString. So, we will not 
> introduce issues when we change the semantic of simpleString or the 
> implementation of catalogString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14885) When creating a CatalogColumn, we should use catalogString of a DataType

2016-04-24 Thread Yin Huai (JIRA)
Yin Huai created SPARK-14885:


 Summary: When creating a CatalogColumn, we should use 
catalogString of a DataType
 Key: SPARK-14885
 URL: https://issues.apache.org/jira/browse/SPARK-14885
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Minor


Right now, the data type field of a CatalogColumn is using the string 
representation. When we create this string from a DataType object, there are 
places where we use simpleString. Although catalogString is the same as 
simpleString right now, it is better to use catalogString. So, we will not 
introduce issues when we change the semantic of simpleString or the 
implementation of catalogString.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

2016-04-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255697#comment-15255697
 ] 

Josh Rosen commented on SPARK-14761:


There's already logic for handling this in the JVM Spark SQL's analyzer, so we 
don't need to add any specific error-handling logic in PySpark. Instead, we 
need to make sure that we always pass the complete set of arguments to the JVM 
API so that we can rely on it to perform the validation for us.

> PySpark DataFrame.join should reject invalid join methods even when join 
> columns are not specified
> --
>
> Key: SPARK-14761
> URL: https://issues.apache.org/jira/browse/SPARK-14761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
> def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when 
> `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
> on = [on]
> if on is None or len(on) == 0:
> jdf = self._jdf.join(other._jdf)
> elif isinstance(on[0], basestring):
> if how is None:
> jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
> else:
> assert isinstance(how, basestring), "how should be basestring"
> jdf = self._jdf.join(other._jdf, self._jseq(on), how)
> else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I 
> think that we should refactor this to first process all arguments and then 
> call the three-argument {{_.jdf.join}}. This would handle the above invalid 
> example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) 
> is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14708) Repl Serialization Issue

2016-04-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255690#comment-15255690
 ] 

Josh Rosen commented on SPARK-14708:


[~srowen], I helped [~bill_chambers] to look into this.

My best guess as to what's happening here is that the {{IntWrapper}} case class 
defined in the REPL cell is capturing a reference to the enclosing REPL cell, 
which itself references all previous REPL cells. The ClosureCleaner only runs 
on closures, not arbitrary Java objects passed to {{parallelize}}, so I think 
that serializing the entries of the {{pairs}} RDD ends up serializing a huge 
object graph.

If I'm right, this may not be easily fixable short of fixing the REPL so that 
it omits the hidden outer pointer from the case class when it can be proven to 
be unnecessary.

> Repl Serialization Issue
> 
>
> Key: SPARK-14708
> URL: https://issues.apache.org/jira/browse/SPARK-14708
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Bill Chambers
>Priority: Minor
>
> Run this code 6 times with the :paste command in Spark. You'll see 
> exponential slow downs.
> class IntWrapper(val i: Int) extends Serializable {  }
> var pairs = sc.parallelize(Array((0, new IntWrapper(0
> for (_ <- 0 until 3) {
>   val wrapper = pairs.values.reduce((x,_) => x)
>   pairs = pairs.mapValues(_ => wrapper)
> }
> val result = pairs.collect()
> https://forums.databricks.com/questions/7729/delays-when-running-program-multiple-times-in-note.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14548) Support !> and !< operator in Spark SQL

2016-04-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14548.
-
   Resolution: Fixed
 Assignee: Jia Li
Fix Version/s: 2.0.0

> Support !> and !< operator in Spark SQL
> ---
>
> Key: SPARK-14548
> URL: https://issues.apache.org/jira/browse/SPARK-14548
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jia Li
>Assignee: Jia Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> !< means not less than which is equivalent to >=
> !> means not greater than which is equivalent to <= 
> I'd to create a PR to support these two operators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4452) Shuffle data structures can starve others on the same thread for memory

2016-04-24 Thread Romi Kuntsman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255658#comment-15255658
 ] 

Romi Kuntsman commented on SPARK-4452:
--

Hi, what's the reason this will only be available in Spark 2.0.0, and not 1.6.4 
or 1.7.0?

> Shuffle data structures can starve others on the same thread for memory 
> 
>
> Key: SPARK-4452
> URL: https://issues.apache.org/jira/browse/SPARK-4452
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Tianshuo Deng
>Assignee: Tianshuo Deng
> Fix For: 2.0.0
>
>
> When an Aggregator is used with ExternalSorter in a task, spark will create 
> many small files and could cause too many files open error during merging.
> Currently, ShuffleMemoryManager does not work well when there are 2 spillable 
> objects in a thread, which are ExternalSorter and ExternalAppendOnlyMap(used 
> by Aggregator) in this case. Here is an example: Due to the usage of mapside 
> aggregation, ExternalAppendOnlyMap is created first to read the RDD. It may 
> ask as much memory as it can, which is totalMem/numberOfThreads. Then later 
> on when ExternalSorter is created in the same thread, the 
> ShuffleMemoryManager could refuse to allocate more memory to it, since the 
> memory is already given to the previous requested 
> object(ExternalAppendOnlyMap). That causes the ExternalSorter keeps spilling 
> small files(due to the lack of memory)
> I'm currently working on a PR to address these two issues. It will include 
> following changes:
> 1. The ShuffleMemoryManager should not only track the memory usage for each 
> thread, but also the object who holds the memory
> 2. The ShuffleMemoryManager should be able to trigger the spilling of a 
> spillable object. In this way, if a new object in a thread is requesting 
> memory, the old occupant could be evicted/spilled. Previously the spillable 
> objects trigger spilling by themselves. So one may not trigger spilling even 
> if another object in the same thread needs more memory. After this change The 
> ShuffleMemoryManager could trigger the spilling of an object if it needs to.
> 3. Make the iterator of ExternalAppendOnlyMap spillable. Previously 
> ExternalAppendOnlyMap returns an destructive iterator and can not be spilled 
> after the iterator is returned. This should be changed so that even after the 
> iterator is returned, the ShuffleMemoryManager can still spill it.
> Currently, I have a working branch in progress: 
> https://github.com/tsdeng/spark/tree/enhance_memory_manager. Already made 
> change 3 and have a prototype of change 1 and 2 to evict spillable from 
> memory manager, still in progress. I will send a PR when it's done.
> Any feedback or thoughts on this change is highly appreciated !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL

2016-04-24 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-14691.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.0.0

> Simplify and Unify Error Generation for Unsupported Alter Table DDL
> ---
>
> Key: SPARK-14691
> URL: https://issues.apache.org/jira/browse/SPARK-14691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> So far, we are capturing each unsupported Alter Table in separate visit 
> functions. They should be unified and issue a ParseException instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14875:


Assignee: Apache Spark  (was: Cheng Lian)

> OutputWriterFactory.newInstance shouldn't be private[sql]
> -
>
> Key: SPARK-14875
> URL: https://issues.apache.org/jira/browse/SPARK-14875
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Existing packages like spark-avro need to access 
> {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
> Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255633#comment-15255633
 ] 

Apache Spark commented on SPARK-14875:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12652

> OutputWriterFactory.newInstance shouldn't be private[sql]
> -
>
> Key: SPARK-14875
> URL: https://issues.apache.org/jira/browse/SPARK-14875
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Existing packages like spark-avro need to access 
> {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
> Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14875) OutputWriterFactory.newInstance shouldn't be private[sql]

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14875:


Assignee: Cheng Lian  (was: Apache Spark)

> OutputWriterFactory.newInstance shouldn't be private[sql]
> -
>
> Key: SPARK-14875
> URL: https://issues.apache.org/jira/browse/SPARK-14875
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Existing packages like spark-avro need to access 
> {{OutputFactoryWriter.newInstance}}, but it's marked as {{private\[sql\]}} in 
> Spark 2.0. Should make it public again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

2016-04-24 Thread Krishna Kalyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255595#comment-15255595
 ] 

Krishna Kalyan edited comment on SPARK-14761 at 4/24/16 2:04 PM:
-

Could someone please advice on how the error should be handled, if its an 
invalid join type?.  I am assuming it should return an error message like 
"Incorrect join parameter", is that okay?.

Thanks,
Krishna


was (Author: krishnakalyan3):
Could someone please advice on how the error should be handled, if its an 
invalid join type?.  I am assuming it should return an error message like 
"Please specify a correct join parameter", is that okay?.

Thanks,
Krishna

> PySpark DataFrame.join should reject invalid join methods even when join 
> columns are not specified
> --
>
> Key: SPARK-14761
> URL: https://issues.apache.org/jira/browse/SPARK-14761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
> def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when 
> `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
> on = [on]
> if on is None or len(on) == 0:
> jdf = self._jdf.join(other._jdf)
> elif isinstance(on[0], basestring):
> if how is None:
> jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
> else:
> assert isinstance(how, basestring), "how should be basestring"
> jdf = self._jdf.join(other._jdf, self._jseq(on), how)
> else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I 
> think that we should refactor this to first process all arguments and then 
> call the three-argument {{_.jdf.join}}. This would handle the above invalid 
> example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) 
> is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14761) PySpark DataFrame.join should reject invalid join methods even when join columns are not specified

2016-04-24 Thread Krishna Kalyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255595#comment-15255595
 ] 

Krishna Kalyan commented on SPARK-14761:


Could someone please advice on how the error should be handled, if its an 
invalid join type?.  I am assuming it should return an error message like 
"Please specify a correct join parameter", is that okay?.

Thanks,
Krishna

> PySpark DataFrame.join should reject invalid join methods even when join 
> columns are not specified
> --
>
> Key: SPARK-14761
> URL: https://issues.apache.org/jira/browse/SPARK-14761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> In PySpark, the following invalid DataFrame join will not result an error:
> {code}
> df1.join(df2, how='not-a-valid-join-type')
> {code}
> The signature for `join` is
> {code}
> def join(self, other, on=None, how=None):
> {code}
> and its code ends up completely skipping handling of the `how` parameter when 
> `on` is `None`:
> {code}
>  if on is not None and not isinstance(on, list):
> on = [on]
> if on is None or len(on) == 0:
> jdf = self._jdf.join(other._jdf)
> elif isinstance(on[0], basestring):
> if how is None:
> jdf = self._jdf.join(other._jdf, self._jseq(on), "inner")
> else:
> assert isinstance(how, basestring), "how should be basestring"
> jdf = self._jdf.join(other._jdf, self._jseq(on), how)
> else:
> {code}
> Given that this behavior can mask user errors (as in the above example), I 
> think that we should refactor this to first process all arguments and then 
> call the three-argument {{_.jdf.join}}. This would handle the above invalid 
> example by passing all arguments to the JVM DataFrame for analysis.
> I'm not planning to work on this myself, so this bugfix (+ regression test!) 
> is up for grabs in case someone else wants to do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-24 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255587#comment-15255587
 ] 

Herman van Hovell edited comment on SPARK-14591 at 4/24/16 1:42 PM:


[~yhuai] We define the reserved keywords in here 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4.

Any keyword that is not a part of the {{nonReserved}} rule is a reserved 
keyword. The current reserved keywords can be divided into three groups:
- Symbols: {{+, -, /, !=, <>, ...}}
- Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, 
LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}}
- Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, 
DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, 
ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}}

We could add all reserved keywords to the {{looseNonReserved}} rule and use 
that rule for all names except table aliases (which is where we run into 
trouble).


was (Author: hvanhovell):
[~yhuai] We define the reserved keywords in here 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4.

Any keyword that is not a part of the {{nonReserved}} rule is a reserved 
keyword. The current reserved keywords can be divided into three groups:
- Symbols: {{+, -, /, !=, <>, ...}}
- Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, 
LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}}
- Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, 
DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, 
ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}}

We could add all reserved keywords to the {{looseNonReserved}} rule and use 
that rule for all names except aliases (which is where we run into trouble).

> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Since our parser defined based on antlr 4 can parse data type (see 
> CatalystSqlParser), we can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
> call the parserDataType method of CatalystSqlParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14591) Remove org.apache.spark.sql.catalyst.parser.DataTypeParser

2016-04-24 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255587#comment-15255587
 ] 

Herman van Hovell commented on SPARK-14591:
---

[~yhuai] We define the reserved keywords in here 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4.

Any keyword that is not a part of the {{nonReserved}} rule is a reserved 
keyword. The current reserved keywords can be divided into three groups:
- Symbols: {{+, -, /, !=, <>, ...}}
- Keywords that cannot be made non-reserved (confirmed): {{ANTI, FULL, INNER, 
LEFT, SEMI, RIGHT, NATURAL, UNION, INTERSECT, EXCEPT, DATABASE, SCHEMA}}
- Keyword that can probably be made non-reserved: {{AND, CASE, CAST, CROSS, 
DISTINCT, DIV, ELSE, END, FROM, FUNCTION, HAVING, INTERVAL, JOIN, MACRO, NOT, 
ON, OR, SELECT, STRATIFY, THEN, UNBOUNDED, WHEN, WHERE}}

We could add all reserved keywords to the {{looseNonReserved}} rule and use 
that rule for all names except aliases (which is where we run into trouble).

> Remove org.apache.spark.sql.catalyst.parser.DataTypeParser
> --
>
> Key: SPARK-14591
> URL: https://issues.apache.org/jira/browse/SPARK-14591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> Since our parser defined based on antlr 4 can parse data type (see 
> CatalystSqlParser), we can remove 
> org.apache.spark.sql.catalyst.parser.DataTypeParser. Let's make sure the new 
> parser's functionality is a super set of DataTypeParser. Then, we can remove 
> DataTypeParser. For the object DataTypeParser, we can keep it and let it just 
> call the parserDataType method of CatalystSqlParser.
> *The original description is shown below*
> Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
> will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13267) Document ?params for the v1 REST API

2016-04-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13267:
--
Assignee: Steve Loughran

> Document ?params for the v1 REST API
> 
>
> Key: SPARK-13267
> URL: https://issues.apache.org/jira/browse/SPARK-13267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's some various ? param options in the v1 rest API, which don't get any 
> mention except in the HistoryServerSuite. They should be documented in 
> monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13267) Document ?params for the v1 REST API

2016-04-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13267.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11152
[https://github.com/apache/spark/pull/11152]

> Document ?params for the v1 REST API
> 
>
> Key: SPARK-13267
> URL: https://issues.apache.org/jira/browse/SPARK-13267
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's some various ? param options in the v1 rest API, which don't get any 
> mention except in the HistoryServerSuite. They should be documented in 
> monitoring.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14870) NPE in generate aggregate

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14870:


Assignee: Apache Spark  (was: Sameer Agarwal)

> NPE in generate aggregate
> -
>
> Key: SPARK-14870
> URL: https://issues.apache.org/jira/browse/SPARK-14870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> When ran TPCDS Q14a
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
> (TID 234, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:

[jira] [Assigned] (SPARK-14870) NPE in generate aggregate

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14870:


Assignee: Sameer Agarwal  (was: Apache Spark)

> NPE in generate aggregate
> -
>
> Key: SPARK-14870
> URL: https://issues.apache.org/jira/browse/SPARK-14870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Sameer Agarwal
>
> When ran TPCDS Q14a
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
> (TID 234, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scal

[jira] [Commented] (SPARK-14870) NPE in generate aggregate

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255528#comment-15255528
 ] 

Apache Spark commented on SPARK-14870:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12651

> NPE in generate aggregate
> -
>
> Key: SPARK-14870
> URL: https://issues.apache.org/jira/browse/SPARK-14870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Sameer Agarwal
>
> When ran TPCDS Q14a
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
> (TID 234, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withN

[jira] [Updated] (SPARK-14884) Fix call site for continuous queries

2016-04-24 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-14884:
--
Description: 
Since we've been processing continuous queries in separate threads, the call 
sites are then +run at :0+. It's not wrong but provides very little 
information; in addition, we can not distinguish two queries only from their 
call sites.

!https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png!

!https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png!

  was:Since we've been processing continuous queries in separate threads, the 
call sites are then +run at :0+. It's not wrong but provides very 
little information; in addition, we can not distinguish two queries only from 
their call sites.


> Fix call site for continuous queries
> 
>
> Key: SPARK-14884
> URL: https://issues.apache.org/jira/browse/SPARK-14884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> Since we've been processing continuous queries in separate threads, the call 
> sites are then +run at :0+. It's not wrong but provides very little 
> information; in addition, we can not distinguish two queries only from their 
> call sites.
> !https://cloud.githubusercontent.com/assets/15843379/14766101/a47246b2-0a30-11e6-8d81-06a9a600113b.png!
> !https://cloud.githubusercontent.com/assets/15843379/14766102/a4750226-0a30-11e6-9ada-773d977d902b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14884) Fix call site for continuous queries

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14884:


Assignee: Apache Spark

> Fix call site for continuous queries
> 
>
> Key: SPARK-14884
> URL: https://issues.apache.org/jira/browse/SPARK-14884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>
> Since we've been processing continuous queries in separate threads, the call 
> sites are then +run at :0+. It's not wrong but provides very little 
> information; in addition, we can not distinguish two queries only from their 
> call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14884) Fix call site for continuous queries

2016-04-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14884:


Assignee: (was: Apache Spark)

> Fix call site for continuous queries
> 
>
> Key: SPARK-14884
> URL: https://issues.apache.org/jira/browse/SPARK-14884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> Since we've been processing continuous queries in separate threads, the call 
> sites are then +run at :0+. It's not wrong but provides very little 
> information; in addition, we can not distinguish two queries only from their 
> call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14884) Fix call site for continuous queries

2016-04-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255525#comment-15255525
 ] 

Apache Spark commented on SPARK-14884:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12650

> Fix call site for continuous queries
> 
>
> Key: SPARK-14884
> URL: https://issues.apache.org/jira/browse/SPARK-14884
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> Since we've been processing continuous queries in separate threads, the call 
> sites are then +run at :0+. It's not wrong but provides very little 
> information; in addition, we can not distinguish two queries only from their 
> call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14884) Fix call site for continuous queries

2016-04-24 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-14884:
-

 Summary: Fix call site for continuous queries
 Key: SPARK-14884
 URL: https://issues.apache.org/jira/browse/SPARK-14884
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor


Since we've been processing continuous queries in separate threads, the call 
sites are then +run at :0+. It's not wrong but provides very little 
information; in addition, we can not distinguish two queries only from their 
call sites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org