[jira] [Created] (SPARK-18847) PageRank gives incorrect results for graphs with sinks

2016-12-13 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-18847:
--

 Summary: PageRank gives incorrect results for graphs with sinks
 Key: SPARK-18847
 URL: https://issues.apache.org/jira/browse/SPARK-18847
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 1.2.2, 1.1.1, 1.0.2
Reporter: Andrew Ray


Sink vertices (those with no outgoing edges) should evenly distribute their 
rank to the entire graph but in the current implementation it is just lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18471:
--
Assignee: Anthony Truchet

> In treeAggregate, generate (big) zeros instead of sending them.
> ---
>
> Key: SPARK-18471
> URL: https://issues.apache.org/jira/browse/SPARK-18471
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Anthony Truchet
>Assignee: Anthony Truchet
>Priority: Minor
> Fix For: 2.2.0
>
>
> When using optimization routine like LBFGS, treeAggregate curently sends the 
> zero vector as part of the closure. This zero can be huge (e.g. ML vectors 
> with millions of zeros) but can be easily generated.
> Several option are possible (upcoming patches to come soon for some of them).
> On is to provide a treeAggregateWithZeroGenerator method (either in core on 
> in MLlib) which wrap treeAggregate in an option and generate the zero if None.
> Another one is to rewrite treeAggregate to wrap an underlying implementation 
> which use a zero generator directly.
> There might be other better alternative we have not spotted...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18471.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16037
[https://github.com/apache/spark/pull/16037]

> In treeAggregate, generate (big) zeros instead of sending them.
> ---
>
> Key: SPARK-18471
> URL: https://issues.apache.org/jira/browse/SPARK-18471
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Anthony Truchet
>Priority: Minor
> Fix For: 2.2.0
>
>
> When using optimization routine like LBFGS, treeAggregate curently sends the 
> zero vector as part of the closure. This zero can be huge (e.g. ML vectors 
> with millions of zeros) but can be easily generated.
> Several option are possible (upcoming patches to come soon for some of them).
> On is to provide a treeAggregateWithZeroGenerator method (either in core on 
> in MLlib) which wrap treeAggregate in an option and generate the zero if None.
> Another one is to rewrite treeAggregate to wrap an underlying implementation 
> which use a zero generator directly.
> There might be other better alternative we have not spotted...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18715:
--
Assignee: Wayne Zhang
Priority: Major  (was: Critical)

> Fix wrong AIC calculation in Binomial GLM
> -
>
> Key: SPARK-18715
> URL: https://issues.apache.org/jira/browse/SPARK-18715
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: patch
> Fix For: 2.2.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
> The result is different from that in R.
> The current implementation is:
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
>   }.sum()
> {code} 
> Suggest changing this to 
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> val wt = math.round(weight).toInt
> if (wt == 0){
>   0.0
> } else {
>   dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
> }
>   }.sum()
> {code} 
> 
> 
> The following is an example to illustrate the problem.
> {code}
> val dataset = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(12, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(16, 1.0))
> ).toDF().withColumn("weight", col("label") + 1.0)
> val glr = new GeneralizedLinearRegression()
> .setFamily("binomial")
> .setWeightCol("weight")
> .setRegParam(0)
> val model = glr.fit(dataset)
> model.summary.aic
> {code}
> This calculation shows the AIC is 14.189026847171382. To verify whether this 
> is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik 
> = 5.660918. 
> {code}
> da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
> 0,18,1,1
> 0.5,12,0,1.5
> 1,15,0,2
> 0,13,2,1
> 0,15,1,1
> 0.5,16,1,1.5
> da <- as.data.frame(da)
> f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
> AIC(f)
> -2 * logLik(f)
> {code}
> Now, I check whether the proposed change is correct. The following calculates 
> -2 * LogLik manually and get 5.6609177228379055, the same as that in R.
> {code}
> val predictions = model.transform(dataset)
> -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case 
> Row(y: Double, mu: Double, weight: Double) =>
>   val wt = math.round(weight).toInt
>   if (wt == 0){
> 0.0
>   } else {
> dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>   }
>   }.sum()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18715) Fix wrong AIC calculation in Binomial GLM

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18715.
---
Resolution: Fixed

Issue resolved by pull request 16149
[https://github.com/apache/spark/pull/16149]

> Fix wrong AIC calculation in Binomial GLM
> -
>
> Key: SPARK-18715
> URL: https://issues.apache.org/jira/browse/SPARK-18715
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Priority: Critical
>  Labels: patch
> Fix For: 2.2.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The AIC calculation in Binomial GLM seems to be wrong when there are weights. 
> The result is different from that in R.
> The current implementation is:
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> weight * dist.Binomial(1, mu).logProbabilityOf(math.round(y).toInt)
>   }.sum()
> {code} 
> Suggest changing this to 
> {code}
>   -2.0 * predictions.map { case (y: Double, mu: Double, weight: Double) =>
> val wt = math.round(weight).toInt
> if (wt == 0){
>   0.0
> } else {
>   dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
> }
>   }.sum()
> {code} 
> 
> 
> The following is an example to illustrate the problem.
> {code}
> val dataset = Seq(
>   LabeledPoint(0.0, Vectors.dense(18, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(12, 0.0)),
>   LabeledPoint(1.0, Vectors.dense(15, 0.0)),
>   LabeledPoint(0.0, Vectors.dense(13, 2.0)),
>   LabeledPoint(0.0, Vectors.dense(15, 1.0)),
>   LabeledPoint(0.5, Vectors.dense(16, 1.0))
> ).toDF().withColumn("weight", col("label") + 1.0)
> val glr = new GeneralizedLinearRegression()
> .setFamily("binomial")
> .setWeightCol("weight")
> .setRegParam(0)
> val model = glr.fit(dataset)
> model.summary.aic
> {code}
> This calculation shows the AIC is 14.189026847171382. To verify whether this 
> is correct, I run the same analysis in R but got AIC = 11.66092, -2 * LogLik 
> = 5.660918. 
> {code}
> da <- scan(, what=list(y = 0, x1 = 0, x2 = 0, w = 0), sep = ",")
> 0,18,1,1
> 0.5,12,0,1.5
> 1,15,0,2
> 0,13,2,1
> 0,15,1,1
> 0.5,16,1,1.5
> da <- as.data.frame(da)
> f <- glm(y ~ x1 + x2 , data = da, family = binomial(), weight = w)
> AIC(f)
> -2 * logLik(f)
> {code}
> Now, I check whether the proposed change is correct. The following calculates 
> -2 * LogLik manually and get 5.6609177228379055, the same as that in R.
> {code}
> val predictions = model.transform(dataset)
> -2.0 * predictions.select("label", "prediction", "weight").rdd.map {case 
> Row(y: Double, mu: Double, weight: Double) =>
>   val wt = math.round(weight).toInt
>   if (wt == 0){
> 0.0
>   } else {
> dist.Binomial(wt, mu).logProbabilityOf(math.round(y * weight).toInt)
>   }
>   }.sum()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence

2016-12-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746316#comment-15746316
 ] 

Sean Owen commented on SPARK-18845:
---

See https://issues.apache.org/jira/browse/SPARK-7005 ?

> PageRank has incorrect initialization value that leads to slow convergence
> --
>
> Key: SPARK-18845
> URL: https://issues.apache.org/jira/browse/SPARK-18845
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2
>Reporter: Andrew Ray
>
> All variants of PageRank in GraphX have incorrect initialization value that 
> leads to slow convergence. In the current implementations ranks are seeded 
> with the reset probability when it should be 1. This appears to have been 
> introduced a long time ago in 
> https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90
> This also hides the fact that source vertices (vertices with no incoming 
> edges) are not updated. This is because source vertices generally* have 
> pagerank equal to the reset probability. Therefore both need to be fixed at 
> once.
> PR will be added shortly
> *when there are no sinks -- but that's a separate bug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18846) Fix flakiness in SchedulerIntegrationSuite

2016-12-13 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-18846:


 Summary: Fix flakiness in SchedulerIntegrationSuite
 Key: SPARK-18846
 URL: https://issues.apache.org/jira/browse/SPARK-18846
 Project: Spark
  Issue Type: Test
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor


There is a small race in {{SchedulerIntegrationSuite}} that has failed at least 
one set of tests.  The test assumes that the taskscheduler thread processing 
that last task will finish before the DAGScheduler processes the task event and 
notifies the job waiter, but that is not 100% guaranteed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18845) PageRank has incorrect initialization value that leads to slow convergence

2016-12-13 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-18845:
--

 Summary: PageRank has incorrect initialization value that leads to 
slow convergence
 Key: SPARK-18845
 URL: https://issues.apache.org/jira/browse/SPARK-18845
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 1.2.2
Reporter: Andrew Ray


All variants of PageRank in GraphX have incorrect initialization value that 
leads to slow convergence. In the current implementations ranks are seeded with 
the reset probability when it should be 1. This appears to have been introduced 
a long time ago in 
https://github.com/apache/spark/commit/15a564598fe63003652b1e24527c432080b5976c#diff-b2bf3f97dcd2f19d61c921836159cda9L90

This also hides the fact that source vertices (vertices with no incoming edges) 
are not updated. This is because source vertices generally* have pagerank equal 
to the reset probability. Therefore both need to be fixed at once.

PR will be added shortly

*when there are no sinks -- but that's a separate bug



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-13 Thread Zak Patterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746293#comment-15746293
 ] 

Zak Patterson commented on SPARK-18844:
---

I'm not familiar with the python API much, but it seems to me that the two 
methods available for scala (precision and recall) are not available in python? 
https://github.com/apache/spark/blob/v2.1.0-rc2/python/pyspark/mllib/evaluation.py#L29

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
> Fix For: 2.0.2
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2016-12-13 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746217#comment-15746217
 ] 

Shixiong Zhu commented on SPARK-18838:
--

[~sitalke...@gmail.com] Instead of your proposal, I prefer to make 
ExecutorAllocationManager not depend on the listener bus, which may drop 
critical events.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and these delay is 
> causing job failure. For example, a significant delay in receiving the 
> `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager to 
> remove an executor which is not idle.  The event processor in `ListenerBus` 
> is a single thread which loops through all the Listeners for each event and 
> processes each event synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  
> The single threaded processor often becomes the bottleneck for large jobs.  
> In addition to that, if one of the Listener is very slow, all the listeners 
> will pay the price of delay incurred by the slow listener. 
> To solve the above problems, we plan to have a per listener single threaded 
> executor service and separate event queue. That way we are not bottlenecked 
> by the single threaded event processor and also critical listeners will not 
> be penalized by the slow listeners. The downside of this approach is separate 
> event queue per listener will increase the driver memory footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746212#comment-15746212
 ] 

Sean Owen commented on SPARK-18844:
---

Yeah I think we discussed something like this before, and the drawback was just 
filling up the API with variations that are mostly not used. False positive and 
specificity might see some use. This would have to go in MulticlassMetrics too, 
and the Python API of both, for completeness.

Still that doesn't mean it's not doable. I tend to agree that it makes sense if 
anything to expose the 'computer' API, but then it's not clear how to translate 
that to multiclass and Python.

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
> Fix For: 2.0.2
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-13 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746023#comment-15746023
 ] 

Mike Dusenberry edited comment on SPARK-18281 at 12/13/16 7:56 PM:
---

Here's another interesting finding.  The first (original) example fails with 
the timeout.  However, if you create the DataFrame, do something with it, and 
then create the iterator, it will work.

{code}
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()  # FAILS HERE
row = next(it)
{code}

{code}
df = spark.createDataFrame([[1],[2],[3]])
df.count()
it = df.toLocalIterator()  # No longer fails
row = next(it)
{code}

This leads me to believe there may be something wrong with the creation of the 
DataFrame.


was (Author: mwdus...@us.ibm.com):
Here's another interesting finding.  The first (original) example fails with 
the timeout.  However, if you create the DataFrame, do something with it, and 
then create the iterator, it will work.

{code}
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()  # FAILS HERE
row = next(it)
{code}

{code}
df = spark.createDataFrame([[1],[2],[3]])
df.count()
it = df.toLocalIterator()  # No longer fails
row = next(it)
{code}

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-13 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746023#comment-15746023
 ] 

Mike Dusenberry commented on SPARK-18281:
-

Here's another interesting finding.  The first (original) example fails with 
the timeout.  However, if you create the DataFrame, do something with it, and 
then create the iterator, it will work.

{code}
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()  # FAILS HERE
row = next(it)
{code}

{code}
df = spark.createDataFrame([[1],[2],[3]])
df.count()
it = df.toLocalIterator()  # No longer fails
row = next(it)
{code}

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-13 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746002#comment-15746002
 ] 

Mike Dusenberry commented on SPARK-18281:
-

[~viirya] Thanks for taking on this bug!  I tried out PR, and I'm still running 
into a socket timeout error for the example I gave above:

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/mwdusenb/spark/python/pyspark/sql/dataframe.py", line 416, in 
toLocalIterator
peek = next(iter)
  File "/home/mwdusenb/spark/python/pyspark/rdd.py", line 140, in 
_load_from_socket
for item in serializer.load_stream(rf):
  File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 144, in 
load_stream
yield self._read_with_length(stream)
  File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 161, in 
_read_with_length
length = read_int(stream)
  File "/home/mwdusenb/spark/python/pyspark/serializers.py", line 555, in 
read_int
length = stream.read(4)
  File "/opt/anaconda3/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
{code}

Interestingly, it looks like the {{it = df.toLocalIterator()}} line launches a 
very large number of {{toLocalIterator}} jobs, and then the Python socket times 
out while those jobs are running.

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This 

[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-13 Thread Zak Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zak Patterson updated SPARK-18844:
--
Description: 
BinaryClassificationMetrics only implements Precision (positive predictive 
value) and recall (true positive rate). It should implement more comprehensive 
metrics.

Moreover, the instance variables storing computed counts are marked private, 
and there are no accessors for them. So if one desired to add this 
functionality, one would have to duplicate this calculation, which is not 
trivial:

https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144

Currently Implemented Metrics
---
* Precision (PPV): `precisionByThreshold`
* Recall (Sensitivity, true positive rate): `recallByThreshold`

Desired additional metrics
---
* False omission rate: `forByThreshold`
* False discovery rate: `fdrByThreshold`
* Negative predictive value: `npvByThreshold`
* False negative rate: `fnrByThreshold`
* True negative rate (Specificity): `specificityByThreshold`
* False positive rate: `fprByThreshold`



Alternatives
---

The `createCurve` method is marked private. If it were marked public, and the 
trait BinaryClassificationMetricComputer were also marked public, then it would 
be easy to define new computers to get whatever the user wanted.

  was:
BinaryClassificationMetrics only implements Precision (positive predictive 
value) and recall (true positive rate). It should implement more comprehensive 
metrics.

Moreover, the instance variables storing computed counts are marked private, 
and there are no accessors for them. So if one desired to add this 
functionality, one would have to duplicate this calculation, which is not 
trivial:

https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144

Currently Implemented Metrics
---
* Precision (PPV): `precisionByThreshold`
* Recall (Sensitivity, true positive rate): `recallByThreshold`

Desired additional metrics
---
* False omission rate: `fprByThreshold`
* False discovery rate: `fdrByThreshold`
* Negative predictive value: `npvByThreshold`
* False negative rate: `fnrByThreshold`
* True negative rate (Specificity): `specificityByThreshold`
* False positive rate: `fprByThreshold`



Alternatives
---

The `createCurve` method is marked private. If it were marked public, and the 
trait BinaryClassificationMetricComputer were also marked public, then it would 
be easy to define new computers to get whatever the user wanted.


> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
> Fix For: 2.0.2
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `forByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-13 Thread Zak Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zak Patterson updated SPARK-18844:
--
Remaining Estimate: 5h  (was: 5m)
 Original Estimate: 5h  (was: 5m)

> Add more binary classification metrics to BinaryClassificationMetrics
> -
>
> Key: SPARK-18844
> URL: https://issues.apache.org/jira/browse/SPARK-18844
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Zak Patterson
>Priority: Minor
>  Labels: evaluation
> Fix For: 2.0.2
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> BinaryClassificationMetrics only implements Precision (positive predictive 
> value) and recall (true positive rate). It should implement more 
> comprehensive metrics.
> Moreover, the instance variables storing computed counts are marked private, 
> and there are no accessors for them. So if one desired to add this 
> functionality, one would have to duplicate this calculation, which is not 
> trivial:
> https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144
> Currently Implemented Metrics
> ---
> * Precision (PPV): `precisionByThreshold`
> * Recall (Sensitivity, true positive rate): `recallByThreshold`
> Desired additional metrics
> ---
> * False omission rate: `fprByThreshold`
> * False discovery rate: `fdrByThreshold`
> * Negative predictive value: `npvByThreshold`
> * False negative rate: `fnrByThreshold`
> * True negative rate (Specificity): `specificityByThreshold`
> * False positive rate: `fprByThreshold`
> Alternatives
> ---
> The `createCurve` method is marked private. If it were marked public, and the 
> trait BinaryClassificationMetricComputer were also marked public, then it 
> would be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18844) Add more binary classification metrics to BinaryClassificationMetrics

2016-12-13 Thread Zak Patterson (JIRA)
Zak Patterson created SPARK-18844:
-

 Summary: Add more binary classification metrics to 
BinaryClassificationMetrics
 Key: SPARK-18844
 URL: https://issues.apache.org/jira/browse/SPARK-18844
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.2
Reporter: Zak Patterson
Priority: Minor
 Fix For: 2.0.2


BinaryClassificationMetrics only implements Precision (positive predictive 
value) and recall (true positive rate). It should implement more comprehensive 
metrics.

Moreover, the instance variables storing computed counts are marked private, 
and there are no accessors for them. So if one desired to add this 
functionality, one would have to duplicate this calculation, which is not 
trivial:

https://github.com/apache/spark/blob/v2.0.2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L144

Currently Implemented Metrics
---
* Precision (PPV): `precisionByThreshold`
* Recall (Sensitivity, true positive rate): `recallByThreshold`

Desired additional metrics
---
* False omission rate: `fprByThreshold`
* False discovery rate: `fdrByThreshold`
* Negative predictive value: `npvByThreshold`
* False negative rate: `fnrByThreshold`
* True negative rate (Specificity): `specificityByThreshold`
* False positive rate: `fprByThreshold`



Alternatives
---

The `createCurve` method is marked private. If it were marked public, and the 
trait BinaryClassificationMetricComputer were also marked public, then it would 
be easy to define new computers to get whatever the user wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-12-13 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745944#comment-15745944
 ] 

Matt Cheah commented on SPARK-18278:


[~rxin] - thanks for thinking about this!

The concerns around testing and support burden are certainly valid. The 
alternatives also come with their own sets of concerns though.

If we publish the scheduler as a library:

The current code in the schedulers is not marked as public API. We would need 
to refactor the scheduler code to make an API the Apache project would support 
for third party use (like the K8s integration). There are (at least) two places 
this needs to be done:

* CoarseGrainedSchedulerBackend would need to become extendable, since all of 
the schedulers (standalone, Mesos, yarn-client, and yarn-cluster) currently 
extend this fairly complex class. The CoarseGrainedSchedulerBackend code 
invokes its pluggable methods (doRequestTotalExecutors and doKillExecutors) 
with particular expectations, and hence these expectations would also have to 
remain stable as long as it is a public API.
* SparkSubmit would need to support 3rd party cluster managers. Currently 
SparkSubmit's code includes special case handling for the bundled cluster 
managers, for example in YARN mode spark-submit accepts --queue to specify the 
queue to run the job with. Thus we would need to make the spark-submit argument 
handling pluggable as well for other cluster managers parameters.

Off the top of my head, I could think of numerous ways we could expose both of 
these as plugins, but it's not immediately obvious what the best option is.

If we fork the project:

Maintaining a fork places burden on the fork maintainers to keep the fork up to 
date with the mainline releases. It also makes it unclear what the relationship 
between this feature and its associated fork is with the direction of the Spark 
project as a whole, and what the timeline is for eventual re-integration of the 
fork.  Is there a prior example of this approach working in practice in the 
Spark community?


In either case (library or forking), there's also the question of how we 
encourage alpha testing and early usage of this feature. If the code is not on 
the mainline branch, there needs to be alternative channels outside of the 
Spark releases themselves to announce that this feature is available and that 
we would like feedback on it. It would also be ideal for the code reviews to be 
visible early on, so that everyone that watches the Spark repository can catch 
the updates and progress of this feature.

Having said all of this, I think these issues can be navigated. If I had to 
choose between maintaining a fork versus cleaning up the scheduler to make a 
public API, I would choose the latter in the interest of clarifying the 
relationship between the K8s effort and the mainline project, as well as for 
making the scheduler code cleaner in general. However it's not immediately 
clear if the effort required to make these refactors is worthwhile when we 
could include the K8s scheduler in the Apache releases as an experimental 
feature, ignore its bugs and test failures for the next few releases (that is, 
problems in the K8s-related code should never block releases), and ship this as 
we currently do with YARN and Mesos.

I'd like to hear everyone's thoughts regarding the tradeoffs we are making 
between these different approaches of pushing this feature forward.


> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745879#comment-15745879
 ] 

Marcelo Vanzin commented on SPARK-18840:


[~jerryshao] are you planning to fix this for 2.0/1.6 also? Or should I close 
the bug?

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18835:
---
Fix Version/s: 2.2.0

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 ยป NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18840:
---
Fix Version/s: 2.2.0
   2.1.1

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18840:
---
Assignee: Saisai Shao

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18843:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix timeout in awaitResultInForkJoinSafely
> --
>
> Key: SPARK-18843
> URL: https://issues.apache.org/jira/browse/SPARK-18843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Master has the fix in https://github.com/apache/spark/pull/16230. However, 
> since we don't merge this PR into master because it's too risky, we should at 
> least fix the timeout value for 2.0 and 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely

2016-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745871#comment-15745871
 ] 

Apache Spark commented on SPARK-18843:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16268

> Fix timeout in awaitResultInForkJoinSafely
> --
>
> Key: SPARK-18843
> URL: https://issues.apache.org/jira/browse/SPARK-18843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Master has the fix in https://github.com/apache/spark/pull/16230. However, 
> since we don't merge this PR into master because it's too risky, we should at 
> least fix the timeout value for 2.0 and 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18843:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix timeout in awaitResultInForkJoinSafely
> --
>
> Key: SPARK-18843
> URL: https://issues.apache.org/jira/browse/SPARK-18843
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Master has the fix in https://github.com/apache/spark/pull/16230. However, 
> since we don't merge this PR into master because it's too risky, we should at 
> least fix the timeout value for 2.0 and 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18786) pySpark SQLContext.getOrCreate(sc) take stopped sparkContext

2016-12-13 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745867#comment-15745867
 ] 

Bryan Cutler commented on SPARK-18786:
--

The problem is that {{SQLContext.getOrCreate(sc)}} will not reset even though a 
different {{SparkContext}} is used.  With Spark 2.0, we are moving towards 
{{SparkSession}} so I don't think this is worth fixing.  Using {{SparkSession}} 
like below doesn't seem to have this problem.

{noformat}
import sys
sys.path.insert(1, 'spark/python/')
sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip')
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.json(spark.sparkContext.parallelize(['{{ "name": "Adam" }}']))
spark.stop()
spark = SparkSession.builder.getOrCreate()
spark.read.json(spark.sparkContext.parallelize(['{{ "name": "Adam" }}']))
{noformat}

> pySpark SQLContext.getOrCreate(sc) take stopped sparkContext
> 
>
> Key: SPARK-18786
> URL: https://issues.apache.org/jira/browse/SPARK-18786
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Alex Liu
>
> The following steps to reproduce the issue
> {code}
> import sys
> sys.path.insert(1, 'spark/python/')
> sys.path.insert(1, 'spark/python/lib/py4j-0.9-src.zip')
> from pyspark import SparkContext, SQLContext
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> sc.stop()
> sc = SparkContext.getOrCreate()
> sqlContext = SQLContext.getOrCreate(sc)
> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> {code}
> It has the following errors after the last command
> {code}
> >>> sqlContext.read.json(sc.parallelize(['{{ "name": "Adam" }}'])).show()
> Traceback (most recent call last):
>   
>   File "", line 1, in 
>   File "spark/python/pyspark/sql/dataframe.py", line 257, in show
> print(self._jdf.showString(n, truncate))
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in 
> __call__
>   File "spark/python/pyspark/sql/utils.py", line 45, in deco
> return f(*a, **kw)
>   File "spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in 
> get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o435.showString.
> : java.lang.IllegalStateException: Cannot call methods on a stopped 
> SparkContext.
> This stopped SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
> The currently active SparkContext was created at:
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
> py4j.Gateway.invoke(Gateway.java:214)
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
> py4j.GatewayConnection.run(GatewayConnection.java:209)
> java.lang.Thread.run(Thread.java:745)
>  
>   at 
> org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1325)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:126)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 

[jira] [Created] (SPARK-18843) Fix timeout in awaitResultInForkJoinSafely

2016-12-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18843:


 Summary: Fix timeout in awaitResultInForkJoinSafely
 Key: SPARK-18843
 URL: https://issues.apache.org/jira/browse/SPARK-18843
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.2, 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Master has the fix in https://github.com/apache/spark/pull/16230. However, 
since we don't merge this PR into master because it's too risky, we should at 
least fix the timeout value for 2.0 and 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18835.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.1.1

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.1
>
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 ยป NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-12-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13747.
--
   Resolution: Fixed
Fix Version/s: (was: 2.0.2)
   (was: 2.1.0)
   2.2.0

Issue resolved by pull request 16230
[https://github.com/apache/spark/pull/16230]

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18675) CTAS for hive serde table should work for all hive versions

2016-12-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18675.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16104
[https://github.com/apache/spark/pull/16104]

> CTAS for hive serde table should work for all hive versions
> ---
>
> Key: SPARK-18675
> URL: https://issues.apache.org/jira/browse/SPARK-18675
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745633#comment-15745633
 ] 

Shivaram Venkataraman commented on SPARK-18823:
---

Thanks [~masip85] for verifying this. I think as [~felixcheung] pointed out 
there are two separate issues we can file as feature requests

1. Supporting assignment of DataFrame columns in `[` and `[[` -- This should be 
pretty straight forward I'd guess

2. Supporting assignment of a local R column using `$` and / or `[[`  -- This 
one I'm less sure about because it will involve determining types, serializing 
data from local R and splitting into existing DataFrame etc. Also at a higher 
level if the DataFrame has a 100M rows then it might not be efficient to ship 
that much data etc. 

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18842:
-
Priority: Major  (was: Minor)

> De-duplicate paths in classpaths in processes for local-cluster mode to work 
> around the length limitation on Windows
> 
>
> Key: SPARK-18842
> URL: https://issues.apache.org/jira/browse/SPARK-18842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>
> Currently, some tests are being failed and hanging on Windows due to this 
> problem. For the reason in SPARK-18718, some tests using {{local-cluster}} 
> mode were disabled on Windows due to the length limitation by paths given to 
> classpaths.
> The limitation seems roughly 32K (see 
> https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
> https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
>  but executors were being launched with the command such as 
> https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in 
> (only) tests.
> This length is roughly 40K due to the class paths. However, it seems there 
> are duplicates more than half. So, if we de-duplicate this paths, it is 
> reduced to roughly 20K.
> Maybe, we should consider as some more paths are added in the future but it 
> seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3012) Standardized Distance Functions between two Vectors for MLlib

2016-12-13 Thread Dhaval Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745388#comment-15745388
 ] 

Dhaval Modi commented on SPARK-3012:


I have implemented Mahalanobis Distance in Spark 1.6.2+ using Breeze 0.12 
libraries. One can refer below scala code at:
https://github.com/dhmodi/MahalanobisDistance

> Standardized Distance Functions between two Vectors for MLlib
> -
>
> Key: SPARK-3012
> URL: https://issues.apache.org/jira/browse/SPARK-3012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> Most of the clustering algorithms need distance functions between two Vectors.
> We should include the standardized distance function library in MLlib.
> I think that the standardized distance functions help us to implement more 
> machine learning algorithms efficiently.
> h3. For example
> - Chebyshev Distance
> - Cosine Distance
> - Euclidean Distance
> - Mahalanobis Distance
> - Manhattan Distance
> - Minkowski Distance
> - SquaredEuclidean Distance
> - Tanimoto Distance
> - Weighted Distance
> - WeightedEuclidean Distance
> - WeightedManhattan Distance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745350#comment-15745350
 ] 

Hyukjin Kwon commented on SPARK-18842:
--

This completes the parent task and I can proceed far more.

> De-duplicate paths in classpaths in processes for local-cluster mode to work 
> around the length limitation on Windows
> 
>
> Key: SPARK-18842
> URL: https://issues.apache.org/jira/browse/SPARK-18842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, some tests are being failed and hanging on Windows due to this 
> problem. For the reason in SPARK-18718, some tests using {{local-cluster}} 
> mode were disabled on Windows due to the length limitation by paths given to 
> classpaths.
> The limitation seems roughly 32K (see 
> https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
> https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
>  but executors were being launched with the command such as 
> https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in 
> (only) tests.
> This length is roughly 40K due to the class paths. However, it seems there 
> are duplicates more than half. So, if we de-duplicate this paths, it is 
> reduced to roughly 20K.
> Maybe, we should consider as some more paths are added in the future but it 
> seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18841:


Assignee: (was: Apache Spark)

> PushProjectionThroughUnion exception when there are same column
> ---
>
> Key: SPARK-18841
> URL: https://issues.apache.org/jira/browse/SPARK-18841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Song Jun
>
> {noformat}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> DROP TABLE IF EXISTS p3 ;
> CREATE TABLE p1 (col STRING) ;
> CREATE TABLE p2 (col STRING) ;
> CREATE TABLE p3 (col STRING) ;
> set spark.sql.crossJoin.enabled = true;
> SELECT
>   1 as cste,
>   col
> FROM (
>   SELECT
> col as col
>   FROM (
> SELECT
>   p1.col as col
> FROM p1
> LEFT JOIN p2 
> UNION ALL
> SELECT
>   col
> FROM p3
>   ) T1
> ) T2
> ;
> {noformat}
> it will throw exception:
> {noformat}
> key not found: col#16
> java.util.NoSuchElementException: key not found: col#16
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745339#comment-15745339
 ] 

Apache Spark commented on SPARK-18841:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/16267

> PushProjectionThroughUnion exception when there are same column
> ---
>
> Key: SPARK-18841
> URL: https://issues.apache.org/jira/browse/SPARK-18841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Song Jun
>
> {noformat}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> DROP TABLE IF EXISTS p3 ;
> CREATE TABLE p1 (col STRING) ;
> CREATE TABLE p2 (col STRING) ;
> CREATE TABLE p3 (col STRING) ;
> set spark.sql.crossJoin.enabled = true;
> SELECT
>   1 as cste,
>   col
> FROM (
>   SELECT
> col as col
>   FROM (
> SELECT
>   p1.col as col
> FROM p1
> LEFT JOIN p2 
> UNION ALL
> SELECT
>   col
> FROM p3
>   ) T1
> ) T2
> ;
> {noformat}
> it will throw exception:
> {noformat}
> key not found: col#16
> java.util.NoSuchElementException: key not found: col#16
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18841:


Assignee: Apache Spark

> PushProjectionThroughUnion exception when there are same column
> ---
>
> Key: SPARK-18841
> URL: https://issues.apache.org/jira/browse/SPARK-18841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Song Jun
>Assignee: Apache Spark
>
> {noformat}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> DROP TABLE IF EXISTS p3 ;
> CREATE TABLE p1 (col STRING) ;
> CREATE TABLE p2 (col STRING) ;
> CREATE TABLE p3 (col STRING) ;
> set spark.sql.crossJoin.enabled = true;
> SELECT
>   1 as cste,
>   col
> FROM (
>   SELECT
> col as col
>   FROM (
> SELECT
>   p1.col as col
> FROM p1
> LEFT JOIN p2 
> UNION ALL
> SELECT
>   col
> FROM p3
>   ) T1
> ) T2
> ;
> {noformat}
> it will throw exception:
> {noformat}
> key not found: col#16
> java.util.NoSuchElementException: key not found: col#16
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
> at 
> org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18841:
--
Description: 
{noformat}
DROP TABLE IF EXISTS p1 ;
DROP TABLE IF EXISTS p2 ;
DROP TABLE IF EXISTS p3 ;
CREATE TABLE p1 (col STRING) ;
CREATE TABLE p2 (col STRING) ;
CREATE TABLE p3 (col STRING) ;

set spark.sql.crossJoin.enabled = true;

SELECT
  1 as cste,
  col
FROM (
  SELECT
col as col
  FROM (
SELECT
  p1.col as col
FROM p1
LEFT JOIN p2 
UNION ALL
SELECT
  col
FROM p3
  ) T1
) T2
;
{noformat}

it will throw exception:
{noformat}
key not found: col#16
java.util.NoSuchElementException: key not found: col#16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)
{noformat}

  was:
DROP TABLE IF EXISTS p1 ;
DROP TABLE IF EXISTS p2 ;
DROP TABLE IF EXISTS p3 ;
CREATE TABLE p1 (col STRING) ;
CREATE TABLE p2 (col STRING) ;
CREATE TABLE p3 (col STRING) ;

set spark.sql.crossJoin.enabled = true;

SELECT
  1 as cste,
  col
FROM (
  SELECT
col as col
  FROM (
SELECT
  p1.col as col
FROM p1
LEFT JOIN p2 
UNION ALL
SELECT
  col
FROM p3
  ) T1
) T2
;

it will throw exception:


key not found: col#16
java.util.NoSuchElementException: key not found: col#16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
 

[jira] [Assigned] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18842:


Assignee: Apache Spark

> De-duplicate paths in classpaths in processes for local-cluster mode to work 
> around the length limitation on Windows
> 
>
> Key: SPARK-18842
> URL: https://issues.apache.org/jira/browse/SPARK-18842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, some tests are being failed and hanging on Windows due to this 
> problem. For the reason in SPARK-18718, some tests using {{local-cluster}} 
> mode were disabled on Windows due to the length limitation by paths given to 
> classpaths.
> The limitation seems roughly 32K (see 
> https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
> https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
>  but executors were being launched with the command such as 
> https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in 
> (only) tests.
> This length is roughly 40K due to the class paths. However, it seems there 
> are duplicates more than half. So, if we de-duplicate this paths, it is 
> reduced to roughly 20K.
> Maybe, we should consider as some more paths are added in the future but it 
> seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18842:


Assignee: (was: Apache Spark)

> De-duplicate paths in classpaths in processes for local-cluster mode to work 
> around the length limitation on Windows
> 
>
> Key: SPARK-18842
> URL: https://issues.apache.org/jira/browse/SPARK-18842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, some tests are being failed and hanging on Windows due to this 
> problem. For the reason in SPARK-18718, some tests using {{local-cluster}} 
> mode were disabled on Windows due to the length limitation by paths given to 
> classpaths.
> The limitation seems roughly 32K (see 
> https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
> https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
>  but executors were being launched with the command such as 
> https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in 
> (only) tests.
> This length is roughly 40K due to the class paths. However, it seems there 
> are duplicates more than half. So, if we de-duplicate this paths, it is 
> reduced to roughly 20K.
> Maybe, we should consider as some more paths are added in the future but it 
> seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745319#comment-15745319
 ] 

Apache Spark commented on SPARK-18842:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16266

> De-duplicate paths in classpaths in processes for local-cluster mode to work 
> around the length limitation on Windows
> 
>
> Key: SPARK-18842
> URL: https://issues.apache.org/jira/browse/SPARK-18842
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, some tests are being failed and hanging on Windows due to this 
> problem. For the reason in SPARK-18718, some tests using {{local-cluster}} 
> mode were disabled on Windows due to the length limitation by paths given to 
> classpaths.
> The limitation seems roughly 32K (see 
> https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
> https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
>  but executors were being launched with the command such as 
> https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in 
> (only) tests.
> This length is roughly 40K due to the class paths. However, it seems there 
> are duplicates more than half. So, if we de-duplicate this paths, it is 
> reduced to roughly 20K.
> Maybe, we should consider as some more paths are added in the future but it 
> seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-12-13 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15745309#comment-15745309
 ] 

Song Jun commented on SPARK-18609:
--

open another jira SPARK-18841

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> 

[jira] [Created] (SPARK-18842) De-duplicate paths in classpaths in processes for local-cluster mode to work around the length limitation on Windows

2016-12-13 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-18842:


 Summary: De-duplicate paths in classpaths in processes for 
local-cluster mode to work around the length limitation on Windows
 Key: SPARK-18842
 URL: https://issues.apache.org/jira/browse/SPARK-18842
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Tests
Reporter: Hyukjin Kwon
Priority: Minor


Currently, some tests are being failed and hanging on Windows due to this 
problem. For the reason in SPARK-18718, some tests using {{local-cluster}} mode 
were disabled on Windows due to the length limitation by paths given to 
classpaths.

The limitation seems roughly 32K (see 
https://blogs.msdn.microsoft.com/oldnewthing/20031210-00/?p=41553/ and 
https://support.thoughtworks.com/hc/en-us/articles/213248526-Getting-around-maximum-command-line-length-is-32767-characters-on-Windows)
 but executors were being launched with the command such as 
https://gist.github.com/HyukjinKwon/5bc81061c250d4af5a180869b59d42ea in (only) 
tests.

This length is roughly 40K due to the class paths. However, it seems there are 
duplicates more than half. So, if we de-duplicate this paths, it is reduced to 
roughly 20K.

Maybe, we should consider as some more paths are added in the future but it 
seems better than disabling all the tests for now with minimised changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18841) PushProjectionThroughUnion exception when there are same column

2016-12-13 Thread Song Jun (JIRA)
Song Jun created SPARK-18841:


 Summary: PushProjectionThroughUnion exception when there are same 
column
 Key: SPARK-18841
 URL: https://issues.apache.org/jira/browse/SPARK-18841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Song Jun


DROP TABLE IF EXISTS p1 ;
DROP TABLE IF EXISTS p2 ;
DROP TABLE IF EXISTS p3 ;
CREATE TABLE p1 (col STRING) ;
CREATE TABLE p2 (col STRING) ;
CREATE TABLE p3 (col STRING) ;

set spark.sql.crossJoin.enabled = true;

SELECT
  1 as cste,
  col
FROM (
  SELECT
col as col
  FROM (
SELECT
  p1.col as col
FROM p1
LEFT JOIN p2 
UNION ALL
SELECT
  col
FROM p3
  ) T1
) T2
;

it will throw exception:


key not found: col#16
java.util.NoSuchElementException: key not found: col#16
at scala.collection.MapLike$class.default(MapLike.scala:228)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:31)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:31)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:346)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$2.applyOrElse(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:74)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:281)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$.org$apache$spark$sql$catalyst$optimizer$PushProjectionThroughUnion$$pushToRight(Optimizer.scala:345)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8$$anonfun$apply$31.apply(Optimizer.scala:378)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:378)
at 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion$$anonfun$apply$4$$anonfun$8.apply(Optimizer.scala:376)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17890) scala.ScalaReflectionException

2016-12-13 Thread Arkadiusz Komarzewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744964#comment-15744964
 ] 

Arkadiusz Komarzewski commented on SPARK-17890:
---

I hit this error recently, checked 2.1 (build 2016_12_12_03_37-63693c1-bin) - 
it's still there.
I can try fixing this - can you give me some tips on what to look for?

> scala.ScalaReflectionException
> --
>
> Key: SPARK-17890
> URL: https://issues.apache.org/jira/browse/SPARK-17890
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>Reporter: Khalid Reid
>Priority: Minor
>  Labels: newbie
>
> Hello,
> I am seeing an error message in spark-shell when I map a DataFrame to a 
> Seq\[Foo\].  However, things work fine when I use flatMap.  
> {noformat}
> scala> case class Foo(value:String)
> defined class Foo
> scala> val df = sc.parallelize(List(1,2,3)).toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
> scala> df.map{x => Seq.empty[Foo]}
> scala.ScalaReflectionException: object $line14.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator1$1.apply(:29)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
>   ... 48 elided
> scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
> res2: org.apache.spark.sql.Dataset[Foo] = [value: string]
> {noformat}
> I am seeing the same error reported 
> [here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
>  when I use spark-submit.
> I am new to Spark but I don't expect this to throw an exception.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18796) StreamingQueryManager should not hold a lock when starting a query

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18796:
--
Assignee: Shixiong Zhu

> StreamingQueryManager should not hold a lock when starting a query
> --
>
> Key: SPARK-18796
> URL: https://issues.apache.org/jira/browse/SPARK-18796
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.0
>
>
> Otherwise, the user cannot start any queries when a query is starting. If a 
> query takes a long time to start, the user experience will be pretty bad.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18797) Update spark.logit in sparkr-vignettes

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18797:
--
Assignee: Miao Wang

> Update spark.logit in sparkr-vignettes
> --
>
> Key: SPARK-18797
> URL: https://issues.apache.org/jira/browse/SPARK-18797
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.1.1, 2.2.0
>
>
> spark.logit is added in 2.1. We need to update spark-vignettes to reflect the 
> changes. This is part of SparkR QA work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18642) Spark SQL: Catalyst is scanning undesired columns

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18642:
--
Assignee: Dongjoon Hyun

> Spark SQL: Catalyst is scanning undesired columns
> -
>
> Key: SPARK-18642
> URL: https://issues.apache.org/jira/browse/SPARK-18642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
> Environment: Ubuntu 14.04
> Spark: Local Mode
>Reporter: Mohit
>Assignee: Dongjoon Hyun
>  Labels: performance
> Fix For: 2.0.0
>
>
> When doing a left-join between two tables, say A and B,  Catalyst has 
> information about the projection required for table B. Only the required 
> columns should be scanned.
> Code snippet below explains the scenario:
> scala> val dfA = sqlContext.read.parquet("/home/mohit/ruleA")
> dfA: org.apache.spark.sql.DataFrame = [aid: int, aVal: string]
> scala> val dfB = sqlContext.read.parquet("/home/mohit/ruleB")
> dfB: org.apache.spark.sql.DataFrame = [bid: int, bVal: string]
> scala> dfA.registerTempTable("A")
> scala> dfB.registerTempTable("B")
> scala> sqlContext.sql("select A.aid, B.bid from A left join B on A.aid=B.bid 
> where B.bid<2").explain
> == Physical Plan ==
> Project [aid#15,bid#17]
> +- Filter (bid#17 < 2)
>+- BroadcastHashOuterJoin [aid#15], [bid#17], LeftOuter, None
>   :- Scan ParquetRelation[aid#15,aVal#16] InputPaths: 
> file:/home/mohit/ruleA
>   +- Scan ParquetRelation[bid#17,bVal#18] InputPaths: 
> file:/home/mohit/ruleB
> This is a watered-down example from a production issue which has a huge 
> performance impact.
> External reference: 
> http://stackoverflow.com/questions/40783675/spark-sql-catalyst-is-scanning-undesired-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18752) "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18752:
--
Assignee: Marcelo Vanzin

> "isSrcLocal" parameter to Hive loadTable / loadPartition should come from user
> --
>
> Key: SPARK-18752
> URL: https://issues.apache.org/jira/browse/SPARK-18752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.2.0
>
>
> We ran into an issue with the HiveShim code that calls "loadTable" and 
> "loadPartition" while testing with some recent changes in upstream Hive.
> The semantics in Hive changed slightly, and if you provide the wrong value 
> for "isSrcLocal" you now can end up with an invalid table: the Hive code will 
> move the temp directory to the final destination instead of moving its 
> children.
> The problem in Spark is that HiveShim.scala tries to figure out the value of 
> "isSrcLocal" based on where the source and target directories are; that's not 
> correct. "isSrcLocal" should be set based on the user query (e.g. "LOAD DATA 
> LOCAL" would set it to "true"). So we need to propagate that information from 
> the user query down to HiveShim.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-13 Thread Vicente Masip (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744740#comment-15744740
 ] 

Vicente Masip commented on SPARK-18823:
---

Yes. I've been able to do it with your suggestion.

In my case problem, i use it into a loop trough col names.(i: iterator)

df <- withColumn(df, myColNames[i], cast(df[[i]],"boolean"))

Happy to find the solution. More friendly if [[ is available at the left side. 
Anyway, if this example is inside documentation, it would be a solution too. 
Should be clear, that is available at the right side, but not at the left. And 
that if you need it, WithColumn is the solution.

I think too that x$y <- t$q assignation is available,it should be with all its 
consequences. Am 
I wrong?

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18650) race condition in FileScanRDD.scala

2016-12-13 Thread Soumabrata Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744695#comment-15744695
 ] 

Soumabrata Chakraborty edited comment on SPARK-18650 at 12/13/16 9:43 AM:
--

I am facing the same issue while trying to load a csv. I am using 
org.apache.spark:spark-core_2.10:2.0.2 and 
org.apache.spark:spark-sql_2.10:2.0.2 using the following to read the csv
Dataset dataset = 
sparkSession.read().schema(schema).options(options).csv(path)  ;
The options contain "header":"false" and "sep":"|"

The issue is also reproduced on versions 2.0.0 and 2.0.1 in addition to 2.0.2


was (Author: soumabrata):
I am facing the same issue while trying to load a csv. I am using 
org.apache.spark:spark-core_2.10:2.0.2 and using the following to read the csv
Dataset dataset = 
sparkSession.read().schema(schema).options(options).csv(path)  ;
The options contain "header":"false" and "sep":"|"

> race condition in FileScanRDD.scala
> ---
>
> Key: SPARK-18650
> URL: https://issues.apache.org/jira/browse/SPARK-18650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: scala 2.11
> macos 10.11.6
>Reporter: Jay Goldman
>
> I am attempting to create a DataSet from a single CSV file :
>  val ss: SparkSession = 
>  val ddr = ss.read.option("path", path)
> ... (choose between xml vs csv parsing)
>  var df = ddr.option("sep", ",")
>   .option("quote", "\"")
>   .option("escape", "\"") // want to retain backslashes (\) ...
>   .option("delimiter", ",")
>   .option("comment", "#")
>   .option("header", "true")
>   .option("format", "csv")
>ddr.csv(path)
> df.count() returns 2 times the number of lines in the CSV file - i.e., each 
> line of the input file shows up as 2 rows in df. 
> moreover df.distinct.count has the correct rows.
> There appears to be a problem in FileScanRDD.compute. I am using spark 
> version 2.0.1 with scala 2.11. I am not going to include the entire contents 
> of FileScanRDD.scala here.
> In FileScanRDD.compute there is the following:
>  private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
> If i put a breakpoint in either FileScanRDD.compute or 
> FIleScanRDD.nextIterator the resulting dataset has the correct number of rows.
> Moreover, the code in FileScanRDD.scala is:
> private def nextIterator(): Boolean = {
> updateBytesReadWithFileSize()
> if (files.hasNext) { // breakpoint here => works
>   currentFile = files.next() // breakpoint here => fails
>   
> }
> else {  }
> 
> }
> if i put a breakpoint on the files.hasNext line all is well; however, if i 
> put a breakpoint on the files.next() line the code will fail when i continue 
> because the files iterator has become empty (see stack trace below). 
> Disabling the breakpoint winds up creating a Dataset with each line of the 
> csv file duplicated.
> So it appears that multiple threads are using the files iterator or the 
> underling split value (an RDDPartition) and timing wise on my system 2 
> workers wind up processing the same file, with the resulting DataSet having 2 
> copies of each of the input lines.
> This code is not active when parsing an XML file. 
> here is stack trace:
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at 

[jira] [Commented] (SPARK-18650) race condition in FileScanRDD.scala

2016-12-13 Thread Soumabrata Chakraborty (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744695#comment-15744695
 ] 

Soumabrata Chakraborty commented on SPARK-18650:


I am facing the same issue while trying to load a csv. I am using 
org.apache.spark:spark-core_2.10:2.0.2 and using the following to read the csv
Dataset dataset = 
sparkSession.read().schema(schema).options(options).csv(path)  ;
The options contain "header":"false" and "sep":"|"

> race condition in FileScanRDD.scala
> ---
>
> Key: SPARK-18650
> URL: https://issues.apache.org/jira/browse/SPARK-18650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: scala 2.11
> macos 10.11.6
>Reporter: Jay Goldman
>
> I am attempting to create a DataSet from a single CSV file :
>  val ss: SparkSession = 
>  val ddr = ss.read.option("path", path)
> ... (choose between xml vs csv parsing)
>  var df = ddr.option("sep", ",")
>   .option("quote", "\"")
>   .option("escape", "\"") // want to retain backslashes (\) ...
>   .option("delimiter", ",")
>   .option("comment", "#")
>   .option("header", "true")
>   .option("format", "csv")
>ddr.csv(path)
> df.count() returns 2 times the number of lines in the CSV file - i.e., each 
> line of the input file shows up as 2 rows in df. 
> moreover df.distinct.count has the correct rows.
> There appears to be a problem in FileScanRDD.compute. I am using spark 
> version 2.0.1 with scala 2.11. I am not going to include the entire contents 
> of FileScanRDD.scala here.
> In FileScanRDD.compute there is the following:
>  private[this] val files = split.asInstanceOf[FilePartition].files.toIterator
> If i put a breakpoint in either FileScanRDD.compute or 
> FIleScanRDD.nextIterator the resulting dataset has the correct number of rows.
> Moreover, the code in FileScanRDD.scala is:
> private def nextIterator(): Boolean = {
> updateBytesReadWithFileSize()
> if (files.hasNext) { // breakpoint here => works
>   currentFile = files.next() // breakpoint here => fails
>   
> }
> else {  }
> 
> }
> if i put a breakpoint on the files.hasNext line all is well; however, if i 
> put a breakpoint on the files.next() line the code will fail when i continue 
> because the files iterator has become empty (see stack trace below). 
> Disabling the breakpoint winds up creating a Dataset with each line of the 
> csv file duplicated.
> So it appears that multiple threads are using the files iterator or the 
> underling split value (an RDDPartition) and timing wise on my system 2 
> workers wind up processing the same file, with the resulting DataSet having 2 
> copies of each of the input lines.
> This code is not active when parsing an XML file. 
> here is stack trace:
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/30 09:31:07 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 

[jira] [Comment Edited] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744370#comment-15744370
 ] 

Saisai Shao edited comment on SPARK-18840 at 12/13/16 9:30 AM:
---

This problem also existed in branch 1.6, but the fix is a little different 
compared to master.


was (Author: jerryshao):
This problem also existed in branch 1.6, but the fix is a little complicated 
compared to master.

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18840:


Assignee: (was: Apache Spark)

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744670#comment-15744670
 ] 

Apache Spark commented on SPARK-18840:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/16265

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18840) HDFSCredentialProvider throws exception in non-HDFS security environment

2016-12-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18840:


Assignee: Apache Spark

> HDFSCredentialProvider throws exception in non-HDFS security environment
> 
>
> Key: SPARK-18840
> URL: https://issues.apache.org/jira/browse/SPARK-18840
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Current in {{HDFSCredentialProvider}}, the code logic assumes HDFS delegation 
> token should be existed, this is ok for HDFS environment, but for some cloud 
> environment like Azure, HDFS is not required, so it will throw exception:
> {code}
> java.util.NoSuchElementException: head of empty list
> at scala.collection.immutable.Nil$.head(List.scala:337)
> at scala.collection.immutable.Nil$.head(List.scala:334)
> at 
> org.apache.spark.deploy.yarn.Client.getTokenRenewalInterval(Client.scala:627)
> {code}
> We should also consider this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18837) Very long stage descriptions do not wrap in the UI

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18837:
--
Summary: Very long stage descriptions do not wrap in the UI  (was: It will 
not hidden if job or stage description too long)

> Very long stage descriptions do not wrap in the UI
> --
>
> Key: SPARK-18837
> URL: https://issues.apache.org/jira/browse/SPARK-18837
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
> Attachments: ui-2.0.0.gif, ui-2.1.0.gif
>
>
> *previous*:
> !ui-2.0.0.gif!
> *current*:
> !ui-2.1.0.gif!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18837) It will not hidden if job or stage description too long

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18837:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I agree that at least this needs to be wrappable

> It will not hidden if job or stage description too long
> ---
>
> Key: SPARK-18837
> URL: https://issues.apache.org/jira/browse/SPARK-18837
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Priority: Minor
> Attachments: ui-2.0.0.gif, ui-2.1.0.gif
>
>
> *previous*:
> !ui-2.0.0.gif!
> *current*:
> !ui-2.1.0.gif!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18835) Do not expose shaded types in JavaTypeInference API

2016-12-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744600#comment-15744600
 ] 

Sean Owen commented on SPARK-18835:
---

(Ideally we should fix this for 2.1.0 because it causes the tests to 
consistently fail in at least a couple envs)

> Do not expose shaded types in JavaTypeInference API
> ---
>
> Key: SPARK-18835
> URL: https://issues.apache.org/jira/browse/SPARK-18835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> Currently, {{inferDataType(TypeToken)}} is called from a different maven 
> module, and because we shade Guava, that sometimes leads to errors (e.g. when 
> running tests using maven):
> {noformat}
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.084 sec  
> <<< ERROR!
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)Lscala/Tuple2;
> at 
> test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
> Results :
> Tests in error: 
>   JavaUDFSuite.udf3Test:107 ยป NoSuchMethod 
> org.apache.spark.sql.catalyst.JavaTyp...
> {noformat}
> Instead, we shouldn't expose Guava types in these APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-18804.
-

> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18804) Join doesn't work in Spark on Bigger tables

2016-12-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18804.
---
Resolution: Not A Problem

(Please don't reopen if the discussion has not meaningfully changed. JIRA isn't 
for questions/discussion -- mailing lists are.) I already indicated that you 
need to investigate basics of why the failure occurred. That detail is not here.

> Join doesn't work in Spark on Bigger tables
> ---
>
> Key: SPARK-18804
> URL: https://issues.apache.org/jira/browse/SPARK-18804
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 1.6.1
>Reporter: Gopal Nagar
>
> Hi All,
> Spark1.6.1 has been installed on a AWS EMR 3 node cluster which has 32 GB RAM 
> and 80 GB storage each node. I am trying to join two tables (1.2 GB & 900 MB 
> ) have rows 4607818 & 14273378 respectively. It's running in client mode on 
> Yarn cluster manager.
> If i put the limit as 100 in select query it works fine. But if i try to join 
> on entire data set, Query runs for 3-4 hours and finally gets terminated. I 
> can see always 18 GB free on each nodes.
> I have tried increasing no of executers/cores/partitions. But still doesn't 
> work. This has been tried in PySpark and submitted using Spark Submit command 
> but doesn't run. Please advise.
> Join Query 
> --
> select * FROM table1 as t1 join table2 as t2 on t1.col = t2.col limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2016-12-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744519#comment-15744519
 ] 

Shivaram Venkataraman commented on SPARK-18823:
---

Ah I see your point - `withColumn` does work for this use case ? I agree that 
adding this to `[ <- ` or `[[ <- ` would be a better user experience

{code}
> df2 <- withColumn(df, "tmp1", df$eruptions)
> head(df2)
  eruptions waiting  tmp1
1 3.600  79 3.600
...
{code}

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
> Fix For: 2.0.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2