[jira] [Resolved] (SPARK-14767) Codegen "no constructor found" errors with Maps inside case classes in Datasets

2016-12-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14767.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Codegen "no constructor found" errors with Maps inside case classes in 
> Datasets
> ---
>
> Key: SPARK-14767
> URL: https://issues.apache.org/jira/browse/SPARK-14767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Priority: Critical
> Fix For: 2.2.0
>
>
> When I have a `Map` inside a case class and am trying to use Datasets,
> the simplest operation throws an exception, because the generated code is 
> looking for a constructor with `scala.collection.Map` whereas the constructor 
> takes `scala.collection.immutable.Map`.
> To reproduce:
> {code}
> case class Bug(bug: Map[String, String])
> val ds = Seq(Bug(Map("name" -> "dummy"))).toDS()
> ds.map { b =>
>   b.bug.getOrElse("name", null)
> }.count()
> {code}
> Stacktrace:
> {code}
> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 163, Column 150: No applicable constructor/method 
> found for actual parameters "scala.collection.Map"; candidates are: 
> Bug(scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2016-12-12 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744481#comment-15744481
 ] 

Felix Cheung commented on SPARK-4591:
-

Is SVM part of this?

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Subtasks cover major algorithm groups.  To pick up a review subtask, please:
> * Comment that you are working on it.
> * Compare the public APIs of spark.ml vs. spark.mllib.
> * Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> * Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * single-Row prediction: [SPARK-10413]
> Also, this does not include the following items (but will eventually):
> * User-facing:
> ** Streaming ML
> ** evaluation
> ** pmml
> ** stat
> ** linalg [SPARK-13944]
> * Developer-facing:
> ** optimization
> ** random, rdd
> ** util



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2016-12-12 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18717.
-
   Resolution: Fixed
 Assignee: Andrew Ray
Fix Version/s: 2.2.0

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Damian Momot
>Assignee: Andrew Ray
> Fix For: 2.2.0
>
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-12 Thread Vicente Masip (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744465#comment-15744465
 ] 

Vicente Masip commented on SPARK-18823:
---

Well maybe I haven't explained myself. I wrote the right side in both ways 
using no Spark examples. But my need, is that this no spark examples has at the 
left side of the operation a variable.

Imagine you have to operate over 30 columns. Do I have to handscript every 
operation 30 times? The left side column name cannot be a variable? This sounds 
very important to me.

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18829) Printing to logger

2016-12-12 Thread David Hodeffi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744464#comment-15744464
 ] 

David Hodeffi commented on SPARK-18829:
---

The issue is it's hard to debug application on yarn. stdout file is **not** the 
same file as the logger file. 
option1 : update the function to get arg of type OutputStream and then write 
string to it but its overkill.
option2: simply, go to log4j.properties and configure the logger to print that 
out.  
right now option1 / option2 is not possible, that's why I suggested to print it 
to logger too. 
changing standard output just because you want to see just this specific log is 
not is not a good idea, because it will spam the log file with everything that 
is printed to standard output and you cannot really control that.

> Printing to logger
> --
>
> Key: SPARK-18829
> URL: https://issues.apache.org/jira/browse/SPARK-18829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: ALL
>Reporter: David Hodeffi
>Priority: Trivial
>  Labels: easyfix, patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I would like to print dataframe.show or  df.explain(true)  into log file.
> right now the code print to standard output without a way to redirect it.
> It also cannot be configured on log4j.properties.
> My suggestion is to write to the logger and standard output.
> i.e 
> class DataFrame {..
> override def explain(extended: Boolean): Unit = {
> val explain = ExplainCommand(queryExecution.logical, extended = extended)
> sqlContext.executePlan(explain).executedPlan.executeCollect().foreach {
>   // scalastyle:off println
>   r => {
> println(r.getString(0))
> logger.debug(r.getString(0))
>   }
>  }
>   // scalastyle:on println
> }
>   }
> def show(numRows: Int, truncate: Boolean): Unit = {
> val str =showString(numRows, truncate) 
> println(str)
> logger.debug(str)
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs

2016-12-12 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-18838:

Description: 
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 
The single threaded processor often becomes the bottleneck for large jobs.  In 
addition to that, if one of the Listener is very slow, all the listeners will 
pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a per listener single threaded 
executor service and separate event queue. That way we are not bottlenecked by 
the single threaded event processor and also critical listeners will not be 
penalized by the slow listeners. The downside of this approach is separate 
event queue per listener will increase the driver memory footprint. 




  was:
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 
The single threaded processor often becomes the bottleneck for large jobs.  In 
addition to that, if one of the Listener is very slow, all the listeners will 
pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners. The downside of this approach is separate event queue per 
listener will increase the driver memory footprint. 





> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-12 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744415#comment-15744415
 ] 

Felix Cheung commented on SPARK-18823:
--

How important it is to support
df[[myname]] <- c(1:nrow(df))
or
df[[2]] <- df$eruptions

I think we should support
df$waiting <- c(1:nrow(df))

which I've plan to work on.

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18794) SparkR vignette update: gbt

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744403#comment-15744403
 ] 

Apache Spark commented on SPARK-18794:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16264

> SparkR vignette update: gbt
> ---
>
> Key: SPARK-18794
> URL: https://issues.apache.org/jira/browse/SPARK-18794
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18793) SparkR vignette update: random forest

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744390#comment-15744390
 ] 

Apache Spark commented on SPARK-18793:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16264

> SparkR vignette update: random forest
> -
>
> Key: SPARK-18793
> URL: https://issues.apache.org/jira/browse/SPARK-18793
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover randomForest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18792) SparkR vignette update: logit

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744387#comment-15744387
 ] 

Apache Spark commented on SPARK-18792:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16264

> SparkR vignette update: logit
> -
>
> Key: SPARK-18792
> URL: https://issues.apache.org/jira/browse/SPARK-18792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover logit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-12-12 Thread caolan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744380#comment-15744380
 ] 

caolan commented on SPARK-17147:


Sure, I will give a try on your branch, will let you know the result.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-12 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-18840:

Priority: Minor  (was: Major)

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744370#comment-15744370
 ] 

Saisai Shao commented on SPARK-18840:
-

This problem also existed in branch 1.6, but the fix is a little complicated 
compared to master.

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-12 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-18840:
---

 Summary: HDFSCredentialProvider throws exception in non-HDFS 
security environment
 Key: SPARK-18840
 URL: https://issues.apache.org/jira/browse/SPARK-18840
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.3, 2.1.0
Reporter: Saisai Shao


Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
token should be existed, this is ok for HDFS environment, but for some cloud 
environment like Azure, HDFS is not required, so it will throw exception:

{code}
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:337)
at scala.collection.immutable.Nil$.head(List.scala:334)
at 
org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
{code}

We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18795:
--
Assignee: Miao Wang  (was: Xiangrui Meng)

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18793) SparkR vignette update: random forest

2016-12-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18793:
-

Assignee: Xiangrui Meng

> SparkR vignette update: random forest
> -
>
> Key: SPARK-18793
> URL: https://issues.apache.org/jira/browse/SPARK-18793
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover randomForest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744359#comment-15744359
 ] 

Xiangrui Meng commented on SPARK-18795:
---

[~wangmiao1981] Any updates?

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18794) SparkR vignette update: gbt

2016-12-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18794:
-

Assignee: Xiangrui Meng

> SparkR vignette update: gbt
> ---
>
> Key: SPARK-18794
> URL: https://issues.apache.org/jira/browse/SPARK-18794
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover gradient boosted trees



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18795:
-

Assignee: Xiangrui Meng

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18797.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16222
[https://github.com/apache/spark/pull/16222]

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18839) Executor is active on web, but actually is dead

2016-12-12 Thread meiyoula (JIRA)
meiyoula created SPARK-18839:


 Summary: Executor is active on web, but actually is dead
 Key: SPARK-18839
 URL: https://issues.apache.org/jira/browse/SPARK-18839
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Minor


When a container is preempted, AM find it is completed, driver removes the 
blockmanager. But executor actually dead after a few seconds, during this 
period, it updates blocks, and re-register the blockmanager. so the exeutors 
page show the executor is active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18796) StreamingQueryManager should not hold a lock when starting a query

2016-12-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18796.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16220
[https://github.com/apache/spark/pull/16220]

> StreamingQueryManager should not hold a lock when starting a query
> --
>
> Key: SPARK-18796
> URL: https://issues.apache.org/jira/browse/SPARK-18796
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.1.0
>
>
> Otherwise, the user cannot start any queries when a query is starting. If a 
> query takes a long time to start, the user experience will be pretty bad.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-12 Thread Gopal Nagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744246#comment-15744246
 ] 

Gopal Nagar edited comment on SPARK-18804 at 12/13/16 6:00 AM:
---

Apologies for marking this JIRA as bug. This may not be a bug in Spark. But i 
wanted to get some input on How to make effective join ? Bcoz in my case, job 
fails despite having enough resources.

Any advise on how to track the issue and debugging will help. 


was (Author: gopalnaga...@gmail.com):
Apologies for marking this JIRA as bug. This may not be a bug in Spark. But i 
wanted to get some input on How to make effective join ? Bcoz in my case, job 
fails despite having enough resources.


> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744271#comment-15744271
 ] 

Apache Spark commented on SPARK-18281:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16263

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18281:


Assignee: (was: Apache Spark)

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18281:


Assignee: Apache Spark

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>Assignee: Apache Spark
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-12 Thread Gopal Nagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal Nagar reopened SPARK-18804:
-

Apologies for marking this JIRA as bug. This may not be a bug in Spark. But i 
wanted to get some input on How to make effective join ? Bcoz in my case, job 
fails despite having enough resources.


> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-12 Thread Gopal Nagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal Nagar updated SPARK-18804:

Issue Type: Question  (was: Brainstorming)

> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-12 Thread Gopal Nagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal Nagar updated SPARK-18804:

Issue Type: Brainstorming  (was: Bug)

> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2016-12-12 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744086#comment-15744086
 ] 

Sital Kedia commented on SPARK-18838:
-

[~rxin], [~zsxwing] - Any thoughts on this? 

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a single threaded executor 
> service and separate event queue per listener. That way we are not 
> bottlenecked by the single threaded processor and also critical listeners 
> will not be penalized by the slow listeners. The downside of this approach is 
> separate event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18838) High latency of event processing for large jobs

2016-12-12 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-18838:

Description: 
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event synchronously 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
 
The single threaded processor often becomes the bottleneck for large jobs.  In 
addition to that, if one of the Listener is very slow, all the listeners will 
pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners. The downside of this approach is separate event queue per 
listener will increase the driver memory footprint. 




  was:
Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event 
synchronously. The single threaded processor often becomes the bottleneck for 
large jobs.  In addition to that, if one of the Listener is very slow, all the 
listeners will pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners.





> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a single threaded executor 
> service and separate event queue per listener. That way we are not 
> bottlenecked by the single threaded processor and also critical listeners 
> will not be penalized by the slow listeners. The downside of this approach is 
> separate event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18838) High latency of event processing for large jobs

2016-12-12 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-18838:
---

 Summary: High latency of event processing for large jobs
 Key: SPARK-18838
 URL: https://issues.apache.org/jira/browse/SPARK-18838
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Sital Kedia


Currently we are observing the issue of very high event processing delay in 
driver's `ListenerBus` for large jobs with many tasks. Many critical component 
of the scheduler like `ExecutorAllocationManager`, `HeartbeatReceiver` depend 
on the `ListenerBus` events and these delay is causing job failure. For 
example, a significant delay in receiving the `SparkListenerTaskStart` might 
cause `ExecutorAllocationManager` manager to remove an executor which is not 
idle.  The event processor in `ListenerBus` is a single thread which loops 
through all the Listeners for each event and processes each event 
synchronously. The single threaded processor often becomes the bottleneck for 
large jobs.  In addition to that, if one of the Listener is very slow, all the 
listeners will pay the price of delay incurred by the slow listener. 

To solve the above problems, we plan to have a single threaded executor service 
and separate event queue per listener. That way we are not bottlenecked by the 
single threaded processor and also critical listeners will not be penalized by 
the slow listeners.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-12 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744047#comment-15744047
 ] 

Liang-Chi Hsieh commented on SPARK-18281:
-

[~mwdus...@us.ibm.com] I can reproduce your issue. I already have the fixing 
too. If you are not working on this, I will submit a PR for it.

BTW, I can't exactly reproduce the issue reported by [~lminer]:

{code}
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize(range(10))
[x for x in rdd.toLocalIterator()]
{code}

But the following one will be failed:
{code}
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize(range(10))
it = rdd.toLocalIterator()
next(it)
{code}

They are caused by the same bug. I'd fix them together.


> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15744035#comment-15744035
 ] 

Apache Spark commented on SPARK-17932:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/16262

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18837) It will not hidden if job or stage description too long

2016-12-12 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-18837:

Description: 
*previous*:

!ui-2.0.0.gif!



*current*:

!ui-2.1.0.gif!

  was:!attached-image.gif!


> It will not hidden if job or stage description too long
> ---
>
> Key: SPARK-18837
> URL: https://issues.apache.org/jira/browse/SPARK-18837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
> Attachments: ui-2.0.0.gif, ui-2.1.0.gif
>
>
> *previous*:
> !ui-2.0.0.gif!
> *current*:
> !ui-2.1.0.gif!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18837) It will not hidden if job or stage description too long

2016-12-12 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-18837:

Description: !attached-image.gif!

> It will not hidden if job or stage description too long
> ---
>
> Key: SPARK-18837
> URL: https://issues.apache.org/jira/browse/SPARK-18837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
> Attachments: ui-2.0.0.gif, ui-2.1.0.gif
>
>
> !attached-image.gif!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18837) It will not hidden if job or stage description too long

2016-12-12 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-18837:

Attachment: ui-2.0.0.gif
ui-2.1.0.gif

> It will not hidden if job or stage description too long
> ---
>
> Key: SPARK-18837
> URL: https://issues.apache.org/jira/browse/SPARK-18837
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
> Attachments: ui-2.0.0.gif, ui-2.1.0.gif
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18837) It will not hidden if job or stage description too long

2016-12-12 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-18837:
---

 Summary: It will not hidden if job or stage description too long
 Key: SPARK-18837
 URL: https://issues.apache.org/jira/browse/SPARK-18837
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2016-12-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-18113:
---
Description: 
Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
handles them. But executor ignores the first response(true) and receives the 
second response(false). The TaskAttemptNumber for this partition in 
authorizedCommittersByStage is locked forever. Driver enters into infinite loop.

h4. Driver Log:

{noformat}
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
partition: 24, attemptNumber: 0
...
16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
stage: 2, partition: 24, attempt: 0
...
16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
partition: 24, attemptNumber: 1
16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
stage: 2, partition: 24, attempt: 1
...
16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 
28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
...
{noformat}

h4. Executor Log:

{noformat}
...
16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
...
16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
AskPermissionToCommitOutput(2,24,0)] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. 
This timeout is controlled by spark.rpc.askTimeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at 
org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
at 
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
... 13 more
...
16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119)
...
16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: 
attempt_201610250536_0002_m_24_119: Not committed because the driver did 
not authorize commit
...
{noformat}


  was:
Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
handles them. But executor ignores the first response(true) and receives the 
second response(false). The TaskAttemptNumber for this partition in 
authorizedCommittersByStage is locked forever. Driver enters into infinite loop.

h4. Driver Log:
16/10/25 05:38:28 INFO TaskSetManager: Starting task 24

[jira] [Updated] (SPARK-17664) Failed to saveAsHadoop when speculate is enabled

2016-12-12 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-17664:
---
Description: 
>From follow logs, task 22 has failed 4 times because of "the driver did not 
>authorize commit". But the strange thing was that I could't find task 22.1. 
>Why? Maybe some synchronization error?

{noformat}
16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 
953902) on executor 10.196.131.13: java.security.PrivilegedActionException 
(null) [duplicate 4]
16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.2 in stage 1856.0 (TID 
954074, 10.215.143.14, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.2 in stage 1856.0 (TID 
954074) on executor 10.215.143.14: java.security.PrivilegedActionException 
(null) [duplicate 5]
16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.3 in stage 1856.0 (TID 
954075, 10.196.131.28, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.3 in stage 1856.0 (TID 
954075) on executor 10.196.131.28: java.security.PrivilegedActionException 
(null) [duplicate 6]
16/09/26 02:14:19 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 
10.196.131.13) as speculatable because it ran more than 5601 ms
16/09/26 02:14:19 INFO TaskSetManager: Starting task 22.4 in stage 1856.0 (TID 
954076, 10.215.153.225, partition 22,PROCESS_LOCAL, 2163 bytes)
16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.4 in stage 1856.0 (TID 
954076) on executor 10.215.153.225: java.security.PrivilegedActionException 
(null) [duplicate 7]
16/09/26 02:14:19 ERROR TaskSetManager: Task 22 in stage 1856.0 failed 4 times; 
aborting job
16/09/26 02:14:19 INFO YarnClusterScheduler: Cancelling stage 1856
16/09/26 02:14:19 INFO YarnClusterScheduler: Stage 1856 was cancelled
16/09/26 02:14:19 INFO DAGScheduler: ResultStage 1856 (saveAsHadoopFile at 
TDWProvider.scala:514) failed in 23.049 s
16/09/26 02:14:19 INFO DAGScheduler: Job 76 failed: saveAsHadoopFile at 
TDWProvider.scala:514, took 69.865181 s
16/09/26 02:14:19 ERROR ApplicationMaster: User class threw exception: 
java.security.PrivilegedActionException: org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 22 in stage 1856.0 failed 4 times, most 
recent failure: Lost task 22.4 in stage 1856.0 (TID 954076, 10.215.153.225): 
java.security.PrivilegedActionException: 
org.apache.spark.executor.CommitDeniedException: 
attempt_201609260213_1856_m_22_954076: Not committed because the driver did 
not authorize commit
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:356)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1723)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1284)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.spark.executor.CommitDeniedException: 
attempt_201609260213_1856_m_22_954076: Not committed because the driver did 
not authorize commit
at 
org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135)
at 
org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:142)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1311)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1284)
... 11 more
{noformat}

  was:
>From follow logs, task 22 has failed 4 times because of "the driver did not 
>authorize commit". But the strange thing was that I could't find task 22.1. 
>Why? Maybe some synchronization error?


16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 
953902) on executor 10.196.131.13: java.security.PrivilegedActionException 
(null) [duplicate 4]
16/09/26 02:14:18 INFO TaskSetMana

[jira] [Updated] (SPARK-18834) Expose event time time stats through StreamingQueryProgress

2016-12-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18834:
--
Summary: Expose event time time stats through StreamingQueryProgress  (was: 
Expose event time and processing time stats through StreamingQueryProgress)

> Expose event time time stats through StreamingQueryProgress
> ---
>
> Key: SPARK-18834
> URL: https://issues.apache.org/jira/browse/SPARK-18834
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18816:
-
Assignee: Alex Bozarth

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Alex Bozarth
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18023) Adam optimizer

2016-12-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18023:
--
Shepherd:   (was: Joseph K. Bradley)

> Adam optimizer
> --
>
> Key: SPARK-18023
> URL: https://issues.apache.org/jira/browse/SPARK-18023
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Vincent
>Priority: Minor
>
> It could be incredibly slow for SGD methods to diverge or converge if their  
> learning rate alpha are set inappropriately, many alternative methods have 
> been proposed to produce desirable convergence with less dependence on 
> hyperparameter settings, and to help prevent local optimum, e.g. Momentom, 
> NAG (Nesterov's Accelerated Gradient), Adagrad, RMSProp etc.
> Among which, Adam is one of the popular algorithms, which is for first-order 
> gradient-based optimization of stochastic objective functions. It's proved to 
> be well suited for problems with large data and/or parameters, and for 
> problems with noisy and/or sparse gradients and is computationally efficient. 
> Refer to this paper for details
> In fact, Tensorflow has implemented most of the adaptive optimization methods 
> mentioned, and we have seen that Adam out performs most of SGD methods in 
> certain cases, such as very sparse dataset in a FM model.
> It could be nice for Spark to have these adaptive optimization methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743648#comment-15743648
 ] 

Shivaram Venkataraman commented on SPARK-18823:
---

We don't support assigning to columns using `[` and `[[` -- the code is just 
not there, so this is more of a missing feature than a bug. We do support 
creating new columns with the `$` sign -- for example df$eruptions_new <- 
df$eruptions + 10 -- But there is a limitation that the right hand side has to 
be a Column and thus `c(1:nrow(df)` will not work there as well. 

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18773) Make translation of Spark configs to commons-crypto configs consistent

2016-12-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18773:
-
Assignee: Marcelo Vanzin

> Make translation of Spark configs to commons-crypto configs consistent
> --
>
> Key: SPARK-18773
> URL: https://issues.apache.org/jira/browse/SPARK-18773
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> Recent changes to introduce AES encryption to the network layer added some 
> duplication to the code that translates between Spark configuration and 
> commons-crypto configuration.
> Moreover, the duplication is not consistent: the code in the network-common 
> module does not translate all configs.
> We should centralize that code and make all the code paths that use AES 
> encryption support the same options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18773) Make translation of Spark configs to commons-crypto configs consistent

2016-12-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18773.
--
Resolution: Fixed

> Make translation of Spark configs to commons-crypto configs consistent
> --
>
> Key: SPARK-18773
> URL: https://issues.apache.org/jira/browse/SPARK-18773
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> Recent changes to introduce AES encryption to the network layer added some 
> duplication to the code that translates between Spark configuration and 
> commons-crypto configuration.
> Moreover, the duplication is not consistent: the code in the network-common 
> module does not translate all configs.
> We should centralize that code and make all the code paths that use AES 
> encryption support the same options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18773) Make translation of Spark configs to commons-crypto configs consistent

2016-12-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18773:
-
Fix Version/s: 2.2.0

> Make translation of Spark configs to commons-crypto configs consistent
> --
>
> Key: SPARK-18773
> URL: https://issues.apache.org/jira/browse/SPARK-18773
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> Recent changes to introduce AES encryption to the network layer added some 
> duplication to the code that translates between Spark configuration and 
> commons-crypto configuration.
> Moreover, the duplication is not consistent: the code in the network-common 
> module does not translate all configs.
> We should centralize that code and make all the code paths that use AES 
> encryption support the same options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18836:


Assignee: Apache Spark

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743623#comment-15743623
 ] 

Apache Spark commented on SPARK-18836:
--

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/16261

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18836:


Assignee: (was: Apache Spark)

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-12 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-18836:
-

 Summary: Serialize Task Metrics once per stage
 Key: SPARK-18836
 URL: https://issues.apache.org/jira/browse/SPARK-18836
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Shivaram Venkataraman


Right now we serialize the empty task metrics once per task -- Since this is 
shared across all tasks we could use the same serialized task metrics across 
all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18835:


Assignee: (was: Apache Spark)

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 » NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18835:


Assignee: Apache Spark

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 » NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743565#comment-15743565
 ] 

Apache Spark commented on SPARK-18835:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16260

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 » NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-12 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-18835:
--

 Summary: Do not expose shaded types in JavaTypeInference API
 Key: SPARK-18835
 URL: https://issues.apache.org/jira/browse/SPARK-18835
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Marcelo Vanzin
Priority: Minor


Currently, {{inferDataType(TypeToken)}} is called from a different maven 
module, and because we shade Guava, that sometimes leads to errors (e.g. when 
running tests using maven):

{noformat}
udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  <<< 
ERROR!
java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
at 
test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)


Results :

Tests in error: 
  JavaUDFSuite.udf3Test:107 » NoSuchMethod 
org.apache.spark.sql.catalyst.JavaTyp...
{noformat}

Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18834) Expose event time and processing time stats through StreamingQueryProgress

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18834:


Assignee: Apache Spark  (was: Tathagata Das)

> Expose event time and processing time stats through StreamingQueryProgress
> --
>
> Key: SPARK-18834
> URL: https://issues.apache.org/jira/browse/SPARK-18834
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18834) Expose event time and processing time stats through StreamingQueryProgress

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18834:


Assignee: Tathagata Das  (was: Apache Spark)

> Expose event time and processing time stats through StreamingQueryProgress
> --
>
> Key: SPARK-18834
> URL: https://issues.apache.org/jira/browse/SPARK-18834
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18834) Expose event time and processing time stats through StreamingQueryProgress

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743528#comment-15743528
 ] 

Apache Spark commented on SPARK-18834:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16258

> Expose event time and processing time stats through StreamingQueryProgress
> --
>
> Key: SPARK-18834
> URL: https://issues.apache.org/jira/browse/SPARK-18834
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-12 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743524#comment-15743524
 ] 

Mike Dusenberry commented on SPARK-18281:
-

I'm also seeing the same error with both Python 2.7 and Python 3.5 on Spark 
2.0.2 and the Git master when using {{rdd.toLocalIterator()}} or 
{{df.toLocalIterator()}} for a PySpark RDD and DataFrame, respectively.

On Spark 1.6.x, {{rdd.toLocalIterator()}} worked correctly.

Here's another example using DataFrames:
{code}
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()   # should timeout here with an 
"java.net.SocketTimeoutException: Accept timed out" error
row = next(it)   # throws an "Exception: could not open socket" error
{code}

Result:
{code}
ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at 
org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:697)
{code}

[~davies] I see that [SPARK-14334 | 
https://issues.apache.org/jira/browse/SPARK-14334] re-engineered and expanded 
the {{toLocalIterator}} functionality to DataSets/DataFrames for both 
Scala/Java & Python.  Do you have any thoughts on the issue that is arising now?

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743513#comment-15743513
 ] 

Joseph K. Bradley commented on SPARK-18813:
---

Those are definitely some top items in my mind too.  Personally, I plan to 
focus on feature parity + Python parity, as well as ML persistence 
improvements, but please do add items which you're able to shepherd.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read http://spark.apache.org/contributing.html carefully. Code 
> style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Do not set these fields: Target Version, Fix Version, or Shepherd.  Only 
> Committers should set those.
> Writing and reviewing PRs
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * *Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.*
> h2. For Committers
> Adding to this roadmap
> * You can update the roadmap by (a) adding issues to this list and (b) 
> setting Target Versions.  Only Committers may make these changes.
> * *If you add an issue to this roadmap or set 

[jira] [Created] (SPARK-18834) Expose event time and processing time stats through StreamingQueryProgress

2016-12-12 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-18834:
-

 Summary: Expose event time and processing time stats through 
StreamingQueryProgress
 Key: SPARK-18834
 URL: https://issues.apache.org/jira/browse/SPARK-18834
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18834) Expose event time and processing time stats through StreamingQueryProgress

2016-12-12 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18834:
--
Issue Type: Improvement  (was: Bug)

> Expose event time and processing time stats through StreamingQueryProgress
> --
>
> Key: SPARK-18834
> URL: https://issues.apache.org/jira/browse/SPARK-18834
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743486#comment-15743486
 ] 

Apache Spark commented on SPARK-18752:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16257

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18816:


Assignee: Apache Spark

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18833) Changing partition location using the 'ALTER TABLE .. SET LOCATION' command via beeline doesn't get reflected in Spark

2016-12-12 Thread Salil Surendran (JIRA)
Salil Surendran created SPARK-18833:
---

 Summary: Changing partition location using the 'ALTER TABLE .. SET 
LOCATION' command via beeline doesn't get reflected in Spark
 Key: SPARK-18833
 URL: https://issues.apache.org/jira/browse/SPARK-18833
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2
Reporter: Salil Surendran


Use the 'ALTER TABLE' command to change the partition location of a table via 
beeline. spark-shell doesn't find any of the data from the table even though 
the data can be read via beeline. To reproduce do the following:

== At hive side: ===
hive> CREATE EXTERNAL TABLE testA (id STRING, name STRING) PARTITIONED BY (idP 
STRING) STORED AS PARQUET LOCATION '/user/root/A/' ;
hive> CREATE EXTERNAL TABLE testB (id STRING, name STRING) PARTITIONED BY (idP 
STRING) STORED AS PARQUET LOCATION '/user/root/B/' ;
hive> CREATE EXTERNAL TABLE testC (id STRING, name STRING) PARTITIONED BY (idP 
STRING) STORED AS PARQUET LOCATION '/user/root/C/' ;

hive> insert into table testA PARTITION (idP='1') values 
('1',"test"),('2',"test2");

hive> ALTER TABLE testB ADD IF NOT EXISTS PARTITION(idP=‘1’);
hive> ALTER TABLE testB PARTITION (idP='1') SET LOCATION '/user/root/A/idp=1/';

hive> select * from testA;
OK
1 test 1
2 test2 1


hive> select * from testB;
OK
1 test 1
2 test2 1

Conclusion: it worked changing the location to the place where the parquet file 
is present.


=== At spark side: ===
scala> import org.apache.spark.sql.hive.HiveContext
scala> val hiveContext = new HiveContext(sc)

scala> hiveContext.refreshTable("testB")

scala> hiveContext.sql("select * from testB").count
res2: Long = 0

scala> hiveContext.sql("ALTER TABLE testC ADD IF NOT EXISTS PARTITION(idP='1')")
res3: org.apache.spark.sql.DataFrame = [result: string]

scala> hiveContext.sql("ALTER TABLE testC PARTITION (idP='1') SET LOCATION 
'/user/root/A/idp=1/' ")
res4: org.apache.spark.sql.DataFrame = [result: string]

scala> hiveContext.sql("select * from testC").count
res6: Long = 0

scala> hiveContext.refreshTable("testC")

scala> hiveContext.sql("select * from testC").count
res8: Long = 0 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743453#comment-15743453
 ] 

Apache Spark commented on SPARK-18816:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/16256

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18816:


Assignee: (was: Apache Spark)

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743419#comment-15743419
 ] 

Alex Bozarth commented on SPARK-18816:
--

I have a fix, just running a few tests then I'll open a pr

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18810) SparkR install.spark does not work for RCs, snapshots

2016-12-12 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-18810.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16248
[https://github.com/apache/spark/pull/16248]

> SparkR install.spark does not work for RCs, snapshots
> -
>
> Key: SPARK-18810
> URL: https://issues.apache.org/jira/browse/SPARK-18810
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> We publish source archives of the SparkR package now in RCs and in nightly 
> snapshot builds. One of the problems that still remains is that 
> `install.spark` does not work for these as it looks for the final Spark 
> version to be present in the apache download mirrors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18681) Throw Filtering is supported only on partition keys of type string exception

2016-12-12 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18681.
---
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 2.1.1

> Throw Filtering is supported only on partition keys of type string exception
> 
>
> Key: SPARK-18681
> URL: https://issues.apache.org/jira/browse/SPARK-18681
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.1.1
>
>
> Cloudera put 
> {{/var/run/cloudera-scm-agent/process/15000-hive-HIVEMETASTORE/hive-site.xml}}
>  as the configuration file for the Hive Metastore Server, where 
> {{hive.metastore.try.direct.sql=false}}. But Spark reading the gateway 
> configuration file and get default value 
> {{hive.metastore.try.direct.sql=true}}. we should use {{getMetaConf}} or 
> {{getMSC.getConfigValue}} method to obtain the original configuration from 
> Hive Metastore Server.
> {noformat}
> spark-sql> CREATE TABLE test (value INT) PARTITIONED BY (part INT);
> Time taken: 0.221 seconds
> spark-sql> select * from test where part=1 limit 10;
> 16/12/02 08:33:45 ERROR thriftserver.SparkSQLDriver: Failed in [select * from 
> test where part=1 limit 10]
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:610)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:549)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:282)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:547)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:954)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:91)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:938)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:156)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2435)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:295)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:134)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$hiveResultString$4.apply(QueryExecution.scala:133)
>   at 
> org.apache.spark.sql.execution.SQLExecution

[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743362#comment-15743362
 ] 

Reynold Xin commented on SPARK-18676:
-

Can we just increase the size by 5X if it is a Parquet or ORC file?


> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18746) Add implicit encoders for BigDecimal, timestamp and date

2016-12-12 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18746:
-
Description: 
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

In this pR, implicit encoders for java.math.BigDecimal will be added in the PR. 
Also, timestamp and date 

  was:
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

To fix the error above, an implicit encoder for java.math.BigDecimal will be 
added in the PR. Also, 


> Add implicit encoders for BigDecimal, timestamp and date
> 
>
> Key: SPARK-18746
> URL: https://issues.apache.org/jira/browse/SPARK-18746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weiqing Yang
>
> Run the code below in spark-shell, there will be an error:
> {code}
> scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(new java.math.BigDecimal(10)))
>   ^
> scala>
> {code} 
> In this pR, implicit encoders for java.math.BigDecimal will be added in the 
> PR. Also, timestamp and date 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18746) Add implicit encoders for BigDecimal, timestamp and date

2016-12-12 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18746:
-
Description: 
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

In this PR, implicit encoders for BigDecimal, timestamp and date will be added.

  was:
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

In this pR, implicit encoders for java.math.BigDecimal will be added in the PR. 
Also, timestamp and date 


> Add implicit encoders for BigDecimal, timestamp and date
> 
>
> Key: SPARK-18746
> URL: https://issues.apache.org/jira/browse/SPARK-18746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weiqing Yang
>
> Run the code below in spark-shell, there will be an error:
> {code}
> scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(new java.math.BigDecimal(10)))
>   ^
> scala>
> {code} 
> In this PR, implicit encoders for BigDecimal, timestamp and date will be 
> added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18746) Add implicit encoders for BigDecimal, timestamp and date

2016-12-12 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18746:
-
Description: 
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

To fix the error above, an implicit encoder for java.math.BigDecimal will be 
added in the PR. Also, 

  was:
Run the code below in spark-shell, there will be an error:
{code}
scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
:24: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are 
supported by importing spark.implicits._  Support for serializing other types 
will be added in future releases.
   spark.createDataset(Seq(new java.math.BigDecimal(10)))
  ^

scala>
{code} 

To fix the error above, {{newBigDecimalEncoder}} will be added in the PR.


> Add implicit encoders for BigDecimal, timestamp and date
> 
>
> Key: SPARK-18746
> URL: https://issues.apache.org/jira/browse/SPARK-18746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weiqing Yang
>
> Run the code below in spark-shell, there will be an error:
> {code}
> scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(new java.math.BigDecimal(10)))
>   ^
> scala>
> {code} 
> To fix the error above, an implicit encoder for java.math.BigDecimal will be 
> added in the PR. Also, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18746) Add implicit encoders for BigDecimal, timestamp and date

2016-12-12 Thread Weiqing Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiqing Yang updated SPARK-18746:
-
Summary: Add implicit encoders for BigDecimal, timestamp and date  (was: 
Add newBigDecimalEncoder)

> Add implicit encoders for BigDecimal, timestamp and date
> 
>
> Key: SPARK-18746
> URL: https://issues.apache.org/jira/browse/SPARK-18746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weiqing Yang
>
> Run the code below in spark-shell, there will be an error:
> {code}
> scala> spark.createDataset(Seq(new java.math.BigDecimal(10)))
> :24: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>spark.createDataset(Seq(new java.math.BigDecimal(10)))
>   ^
> scala>
> {code} 
> To fix the error above, {{newBigDecimalEncoder}} will be added in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18752.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16179
[https://github.com/apache/spark/pull/16179]

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.

2016-12-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743324#comment-15743324
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Hi, [~roadster11].
Could you confirm that your Hive UDTF works well in Hive?

> Spark SQL: Incorrect error message on calling registered UDF.
> -
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Lokesh Yadav
>
> On calling a registered UDF in metastore from spark-sql CLI, it gives a 
> generic error:
> Error in query: Undefined function: 'Sample_UDF'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.
> The functions is registered and it shoes up in the list output by 'show 
> functions'.
> I am using a Hive UDTF, registering it using the statement: create function 
> Sample_UDF as 'com.udf.Sample_UDF' using JAR 
> '/local/jar/path/containing/the/class';
> and I am calling the functions from spark-sql CLI as: SELECT 
> Sample_UDF("input_1", "input_2" )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743313#comment-15743313
 ] 

Alex Bozarth commented on SPARK-18816:
--

And I still don't think this is a blocker, but I respect that you as a 
committer know more about Spark than I do.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743307#comment-15743307
 ] 

Alex Bozarth commented on SPARK-18816:
--

I'm actually looking into a bit right now and I think it's an issue with the 
jQuery code I used when I made the column conditional. If I find a solution 
I'll either open a quick pr or post the info here for you to fix.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18765) Make values for spark.yarn.{am|driver|executor}.memoryOverhead have configurable units

2016-12-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18765.

Resolution: Won't Fix

I missed that this is already fixed in 2.0; since it's a new feature, I'd 
rather not add it to 1.6 (especially since it's unclear we'll have many new 
releases in that line).

> Make values for spark.yarn.{am|driver|executor}.memoryOverhead have 
> configurable units
> --
>
> Key: SPARK-18765
> URL: https://issues.apache.org/jira/browse/SPARK-18765
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3
>Reporter: Daisuke Kobayashi
>Priority: Trivial
>
> {{spark.yarn.\{driver|executor|am\}.memoryOverhead}} values are isolated to 
> Megabytes today. Users provide a value without a unit and Spark assumes its 
> in MBs. Since the overhead is often a few gigabytes, we should change the 
> memory overhead to work the same way as executor or driver memory configs.
> Given 2.0 has already covered this, it's worth to have 1.X code line cover 
> this capability as well. My PR offers users being able to pass the values in 
> multiple ways (backward compatibility is not broken) like:
> {code}
> spark.yarn.executor.memoryOverhead=300m --> converted to 300
> spark.yarn.executor.memoryOverhead=500 --> converted to 500
> spark.yarn.executor.memoryOverhead=1g --> converted to 1024
> spark.yarn.executor.memoryOverhead=1024m --> converted to 1024
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743298#comment-15743298
 ] 

Yin Huai commented on SPARK-18816:
--

Yea, log pages are still there. But, without those links on the executor page, 
it is very hard to find those pages. 

btw, is there any place that we should look at to find the cause of this 
problem?

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18816:
-
Priority: Blocker  (was: Major)

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743288#comment-15743288
 ] 

Alex Bozarth commented on SPARK-18816:
--

Thanks for following up, I was able to recreate the issue, but I personally 
wont have time to fix it before my holiday vacation. It's not a blocker still 
because the log pages are still there, only the links to them are missing. You 
can still access the logs for each worker via the Worker UI links found on the 
Master UI.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18816) executor page fails to show log links if executors are added after an app is launched

2016-12-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743260#comment-15743260
 ] 

Yin Huai commented on SPARK-18816:
--

[~ajbozarth] Yea, please take a look. Thanks! 

The reasons that I set it as a blocker are (1) those log links are super 
important for debugging; and (2) it is a regression from 2.0.

> executor page fails to show log links if executors are added after an app is 
> launched
> -
>
> Key: SPARK-18816
> URL: https://issues.apache.org/jira/browse/SPARK-18816
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> How to reproduce with standalone mode:
> 1. Launch a spark master
> 2. Launch a spark shell. At this point, there is no executor associated with 
> this application. 
> 3. Launch a slave. Now, there is an executor assigned to the spark shell. 
> However, there is no link to stdout/stderr on the executor page (please see 
> https://issues.apache.org/jira/secure/attachment/12842649/screenshot-1.png).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14519) Cross-publish Kafka for Scala 2.12

2016-12-12 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-14519:
--
Summary: Cross-publish Kafka for Scala 2.12  (was: Cross-publish Kafka for 
Scala 2.12.0-M4)

> Cross-publish Kafka for Scala 2.12
> --
>
> Key: SPARK-14519
> URL: https://issues.apache.org/jira/browse/SPARK-14519
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> In order to build the streaming Kafka connector, we need to publish Kafka for 
> Scala 2.12.0-M4. Someone should file an issue against the Kafka project and 
> work with their developers to figure out what will block their upgrade / 
> release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-12-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16297.

   Resolution: Fixed
 Assignee: Oussama Mekni
Fix Version/s: 2.2.0

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>Assignee: Oussama Mekni
>Priority: Minor
> Fix For: 2.2.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-12 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743076#comment-15743076
 ] 

Michael Allman commented on SPARK-18676:


I'm sorry I have not had time to provide more information on my findings. I can 
summarize by saying that none of the source patch variations I tried provided 
an accurate estimate. Making this estimate more accurate could be a good 
project for 2.2.

In the absence of an accurate size estimate, [~davies]'s idea for switching to 
ShuffleJoin for oversized broadcasts sounds like a good idea. [~davies], is 
that something you'd like to work on?

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15844) HistoryServer doesn't come up if spark.authenticate = true

2016-12-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15844.

   Resolution: Fixed
 Assignee: Steve Loughran
Fix Version/s: 2.2.0

> HistoryServer doesn't come up if spark.authenticate = true
> --
>
> Key: SPARK-15844
> URL: https://issues.apache.org/jira/browse/SPARK-15844
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: cluster with spark.authenticate  = true
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 2.2.0
>
>
> If the configuration used to start the history server has 
> {{spark.authenticate}} set, then the server doesn't come up: there's no 
> secret for the {{SecurityManager}}. —even though that secret is used for the 
> secure shuffle, which the history server doesn't go anywhere near



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18817) Ensure nothing is written outside R's tempdir() by default

2016-12-12 Thread Brendan Dwyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brendan Dwyer updated SPARK-18817:
--
Description: 
Per CRAN policies
https://cran.r-project.org/web/packages/policies.html
{quote}
- Packages should not write in the users’ home filespace, nor anywhere else on 
the file system apart from the R session’s temporary directory (or during 
installation in the location pointed to by TMPDIR: and such usage should be 
cleaned up). Installing into the system’s R installation (e.g., scripts to its 
bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package 
obtains confirmation from the user.

- Packages should not modify the global environment (user’s workspace).
{quote}

Currently "spark-warehouse" gets created in the working directory when 
sparkR.session() is called.

  was:
Per CRAN policies
https://cran.r-project.org/web/packages/policies.html
"Packages should not write in the users’ home filespace, nor anywhere else on 
the file system apart from the R session’s temporary directory (or during 
installation in the location pointed to by TMPDIR: and such usage should be 
cleaned up). Installing into the system’s R installation (e.g., scripts to its 
bin directory) is not allowed.
Limited exceptions may be allowed in interactive sessions if the package 
obtains confirmation from the user.

- Packages should not modify the global environment (user’s workspace)."

Currently "spark-warehouse" gets created in the working directory when 
sparkR.session() is called.


> Ensure nothing is written outside R's tempdir() by default
> --
>
> Key: SPARK-18817
> URL: https://issues.apache.org/jira/browse/SPARK-18817
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Brendan Dwyer
>
> Per CRAN policies
> https://cran.r-project.org/web/packages/policies.html
> {quote}
> - Packages should not write in the users’ home filespace, nor anywhere else 
> on the file system apart from the R session’s temporary directory (or during 
> installation in the location pointed to by TMPDIR: and such usage should be 
> cleaned up). Installing into the system’s R installation (e.g., scripts to 
> its bin directory) is not allowed.
> Limited exceptions may be allowed in interactive sessions if the package 
> obtains confirmation from the user.
> - Packages should not modify the global environment (user’s workspace).
> {quote}
> Currently "spark-warehouse" gets created in the working directory when 
> sparkR.session() is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14631) "drop database cascade" needs to unregister functions for HiveExternalCatalog

2016-12-12 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang closed SPARK-14631.
---
Resolution: Not A Problem

> "drop database cascade" needs to unregister functions for HiveExternalCatalog
> -
>
> Key: SPARK-14631
> URL: https://issues.apache.org/jira/browse/SPARK-14631
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> as HIVE-12304, drop database cascade of hive did not drop functions as well. 
> We need to fix this when call `dropDatabase` in HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2016-12-12 Thread Sean McKibben (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15743024#comment-15743024
 ] 

Sean McKibben commented on SPARK-17147:
---

I apologize for the extended radio silence on this. I've been trying to track 
down problems I'm seeing after dynamically scaling down Spark Streaming jobs in 
Mesos. Haven't been able to attribute them to Spark Streaming Kafka (or a 
version thereof) yet but it's been difficult to pin down. I am hoping to return 
to testing the compacted consumer modifications soon.

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.

2016-12-12 Thread Lokesh Yadav (JIRA)
Lokesh Yadav created SPARK-18832:


 Summary: Spark SQL: Incorrect error message on calling registered 
UDF.
 Key: SPARK-18832
 URL: https://issues.apache.org/jira/browse/SPARK-18832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.0.1, 2.0.0
Reporter: Lokesh Yadav


On calling a registered UDF in metastore from spark-sql CLI, it gives a generic 
error:
Error in query: Undefined function: 'Sample_UDF'. This function is neither a 
registered temporary function nor a permanent function registered in the 
database 'default'.

The functions is registered and it shoes up in the list output by 'show 
functions'.

I am using a Hive UDTF, registering it using the statement: create function 
Sample_UDF as 'com.udf.Sample_UDF' using JAR 
'/local/jar/path/containing/the/class';
and I am calling the functions from spark-sql CLI as: SELECT 
Sample_UDF("input_1", "input_2" )





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-12 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742879#comment-15742879
 ] 

Nattavut Sutyanyong commented on SPARK-18814:
-

q92 has the same pattern as q32 and my simplified version. If it's possible, 
could you try patching my PR to verify the problem is resolved?

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Herman van Hovell
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> +- SubqueryAlias date_dim
>+- 
> Rel

[jira] [Commented] (SPARK-18814) CheckAnalysis rejects TPCDS query 32

2016-12-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742828#comment-15742828
 ] 

Kazuaki Ishizaki commented on SPARK-18814:
--

I found the same error {{org.apache.spark.sql.AnalysisException: a GROUP BY 
clause in a scalar correlated subquery cannot contain non-correlated columns: 
ws_item_sk#1081;;}} when I ran q92 using master branch.

> CheckAnalysis rejects TPCDS query 32
> 
>
> Key: SPARK-18814
> URL: https://issues.apache.org/jira/browse/SPARK-18814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Herman van Hovell
>Priority: Blocker
>
> It seems the CheckAnalysis rule introduced by SPARK-18504 is incorrect 
> rejecting this TPCDS query, which ran fine in Spark 2.0. There doesn't seem 
> to be any obvious error in the query or the check rule though: in the plan 
> below, the scalar subquery's condition field is "scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] ", which should reference cs_item_sk#39. 
> Nonetheless CheckAnalysis complains that cs_item_sk#39 is not referenced by 
> the scalar subquery predicates.
> analysis error:
> {code}
> == Query: q32-v1.4 ==
>  Can't be analyzed: org.apache.spark.sql.AnalysisException: a GROUP BY clause 
> in a scalar correlated subquery cannot contain non-correlated columns: 
> cs_item_sk#39;;
> GlobalLimit 100
> +- LocalLimit 100
>+- Aggregate [sum(cs_ext_discount_amt#46) AS excess discount amount#23]
>   +- Filter i_manufact_id#72 = 977) && (i_item_sk#59 = 
> cs_item_sk#39)) && ((d_date#83 >= 2000-01-27) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string && ((d_date_sk#81 = cs_sold_date_sk#58) && 
> (cast(cs_ext_discount_amt#46 as decimal(14,7)) > cast(scalar-subquery#24 
> [(cs_item_sk#39#111 = i_item_sk#59)] as decimal(14,7)
>  :  +- Project [(CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39 AS 
> cs_item_sk#39#111]
>  : +- Aggregate [cs_item_sk#39], 
> [CheckOverflow((promote_precision(cast(1.3 as decimal(11,6))) * 
> promote_precision(cast(avg(cs_ext_discount_amt#46) as decimal(11,6, 
> DecimalType(14,7)) AS (CAST(1.3 AS DECIMAL(11,6)) * 
> CAST(avg(cs_ext_discount_amt) AS DECIMAL(11,6)))#110, cs_item_sk#39]
>  :+- Filter (((d_date#83 >= 2000-01-27]) && (d_date#83 <= 
> cast(cast(cast(cast(2000-01-27 as date) as timestamp) + interval 12 weeks 6 
> days as date) as string))) && (d_date_sk#81 = cs_sold_date_sk#58))
>  :   +- Join Inner
>  :  :- SubqueryAlias catalog_sales
>  :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
>  :  +- SubqueryAlias date_dim
>  : +- 
> Relation[d_date_sk#81,d_date_id#82,d_date#83,d_month_seq#84,d_week_seq#85,d_quarter_seq#86,d_year#87,d_dow#88,d_moy#89,d_dom#90,d_qoy#91,d_fy_year#92,d_fy_quarter_seq#93,d_fy_week_seq#94,d_day_name#95,d_quarter_name#96,d_holiday#97,d_weekend#98,d_following_holiday#99,d_first_dom#100,d_last_dom#101,d_same_day_ly#102,d_same_day_lq#103,d_current_day#104,...
>  4 more fields] parquet
>  +- Join Inner
> :- Join Inner
> :  :- SubqueryAlias catalog_sales
> :  :  +- 
> Relation[cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,cs_ext_wholesale_cost#48,...
>  10 more fields] parquet
> :  +- SubqueryAlias item
> : +- 
> Relation[i_item_sk#59,i_item_id#60,i_rec_start_date#61,i_rec_end_date#62,i_item_desc#63,i_current_price#64,i_wholesale_cost#65,i_brand_id#66,i_brand#67,i_class_id#68,i_class#69,i_category_id#70,i_category#71,i_manufact_id#72,i_manufact#73,i_size#74,i_formulation#75,i_color#76,i_units#77,i_container#78,i_manager_id#79,i_product_name#80]
>  parquet
> 

[jira] [Commented] (SPARK-18829) Printing to logger

2016-12-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742806#comment-15742806
 ] 

Sean Owen commented on SPARK-18829:
---

If the request is "return the explain output as a string instead", OK by me, 
but then this should be edited to reflect that. I don't think it's necessary to 
make an API that specially sends that particular string somewhere.

> Printing to logger
> --
>
> Key: SPARK-18829
> URL: https://issues.apache.org/jira/browse/SPARK-18829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: ALL
>Reporter: David Hodeffi
>Priority: Trivial
>  Labels: easyfix, patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I would like to print dataframe.show or  df.explain(true)  into log file.
> right now the code print to standard output without a way to redirect it.
> It also cannot be configured on log4j.properties.
> My suggestion is to write to the logger and standard output.
> i.e 
> class DataFrame {..
> override def explain(extended: Boolean): Unit = {
> val explain = ExplainCommand(queryExecution.logical, extended = extended)
> sqlContext.executePlan(explain).executedPlan.executeCollect().foreach {
>   // scalastyle:off println
>   r => {
> println(r.getString(0))
> logger.debug(r.getString(0))
>   }
>  }
>   // scalastyle:on println
> }
>   }
> def show(numRows: Int, truncate: Boolean): Unit = {
> val str =showString(numRows, truncate) 
> println(str)
> logger.debug(str)
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18795) SparkR vignette update: ksTest

2016-12-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742790#comment-15742790
 ] 

Joseph K. Bradley commented on SPARK-18795:
---

OK thanks!  Let's keep these vignettes short to make sure we get them in soon.  
We can always expand them later.

> SparkR vignette update: ksTest
> --
>
> Key: SPARK-18795
> URL: https://issues.apache.org/jira/browse/SPARK-18795
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> Update vignettes to cover ksTest



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18325) SparkR 2.1 QA: Check for new R APIs requiring example code

2016-12-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18325:
--
Fix Version/s: 2.2.0

> SparkR 2.1 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-18325
> URL: https://issues.apache.org/jira/browse/SPARK-18325
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.1.1, 2.2.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17409) Query in CTAS is Optimized Twice

2016-12-12 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742691#comment-15742691
 ] 

Xiao Li commented on SPARK-17409:
-

Sure, will follow it. 

> Query in CTAS is Optimized Twice
> 
>
> Key: SPARK-17409
> URL: https://issues.apache.org/jira/browse/SPARK-17409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The query in CTAS is optimized twice, as reported in the PR: 
> https://github.com/apache/spark/pull/14797
> {quote}
> Some analyzer rules have assumptions on logical plans, optimizer may break 
> these assumption, we should not pass an optimized query plan into 
> QueryExecution (will be analyzed again), otherwise we may some weird bugs.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18825) Eliminate duplicate links in SparkR API doc index

2016-12-12 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742601#comment-15742601
 ] 

Shivaram Venkataraman commented on SPARK-18825:
---

Yeah this is tricky because of CRAN requirements / S4 method naming schemes. I 
am not sure there is an easy way out. Longer term we could try to make a better 
index.html page that modifies the default one generated by roxygen2 -- or open 
an issue with them ?

> Eliminate duplicate links in SparkR API doc index
> -
>
> Key: SPARK-18825
> URL: https://issues.apache.org/jira/browse/SPARK-18825
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> The SparkR API docs contain many duplicate links with suffixes {{-method}} or 
> {{-class}} in the index . E.g., {{atan}} and {{atan-method}} link to the same 
> doc.
> Copying from [~felixcheung] in [SPARK-18332]:
> {quote}
> They are because of the
> {{@ aliases}}
> tags. I think we are adding them because CRAN checks require them to match 
> the specific format - [~shivaram] would you know?
> I am pretty sure they are double-listed because in addition to aliases we 
> also have
> {{@ rdname}}
> which automatically generate the links as well.
> I suspect if we change all the rdname to match the string in aliases then 
> there will be one link. I can take a shot at this to test this out, but 
> changes will be very extensive - is this something we could get into 2.1 
> still?
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17892) Query in CTAS is Optimized Twice (branch-2.0)

2016-12-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15742575#comment-15742575
 ] 

Josh Rosen commented on SPARK-17892:


In the future, please re-use the existing JIRA when backporting _OR_ link the 
JIRA. If you go to SPARK-17409 then it's hard to spot that it's been backported 
into branch-2.x.

> Query in CTAS is Optimized Twice (branch-2.0)
> -
>
> Key: SPARK-17892
> URL: https://issues.apache.org/jira/browse/SPARK-17892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.0.2
>
>
> This tracks the work that fixes the problem shown in  
> https://issues.apache.org/jira/browse/SPARK-17409 to branch 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >