[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638555#comment-15638555
 ] 

Cody Koeninger commented on SPARK-18258:


Sure, added, let me know if I'm missing something or can clarify.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.
> After SPARK-17829 is complete and offsets have a .json method, an api for 
> this ticket might look like
> {code}
> trait Sink {
>   def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: 
> OffsetSeq): Unit
> {code}
> where start and end were provided by StreamExecution.runBatch using 
> committedOffsets and availableOffsets.  
> I'm not 100% certain that the offsets in the seq could always be mapped back 
> to the correct source when restarting complicated multi-source jobs, but I 
> think it'd be sufficient.  Passing the string/json representation of the seq 
> instead of the seq itself would probably be sufficient as well, but the 
> convention of rendering a None as "-" in the json is maybe a little 
> idiosyncratic to parse, and the constant defining that is private.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-18258:
---
Description: 
Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the 
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.

After SPARK-17829 is complete and offsets have a .json method, an api for this 
ticket might look like

{code}
trait Sink {
  def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: 
OffsetSeq): Unit
{code}

where start and end were provided by StreamExecution.runBatch using 
committedOffsets and availableOffsets.  

I'm not 100% certain that the offsets in the seq could always be mapped back to 
the correct source when restarting complicated multi-source jobs, but I think 
it'd be sufficient.  Passing the string/json representation of the seq instead 
of the seq itself would probably be sufficient as well, but the convention of 
rendering a None as "-" in the json is maybe a little idiosyncratic to parse, 
and the constant defining that is private.

  was:
Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the 
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.


> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.
> After SPARK-17829 is complete and offsets have a .json method, an api for 
> this ticket might look like
> {code}
> trait Sink {
>   def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: 
> OffsetSeq): Unit
> {code}
> where start and end were provided by StreamExecution.runBatch using 
> committedOffsets and availableOffsets.  
> I'm not 100% certain that the offsets in the seq could always be mapped back 
> to the correct source when restarting complicated multi-source jobs, but I 
> think it'd be sufficient.  Passing the string/json representation of the seq 
> instead of the seq itself would probably be sufficient as well, but the 
> convention of rendering a None as "-" in the json is maybe a little 
> idiosyncratic to parse, and the constant defining that is private.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638546#comment-15638546
 ] 

Felix Cheung commented on SPARK-12757:
--

I'm seeing the same with latest master running a pipeline with GBTClassifier:

{code}
WARN Executor: 1 block locks were not released by TID = 7:
[rdd_28_0]
{code}

to repro, take the code sample from the ml programming guide:
{code}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, 
GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as 
continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a GBT model.
val gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
{code}

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18285) approxQuantile in R support multi-column

2016-11-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18285:


 Summary: approxQuantile in R support multi-column
 Key: SPARK-18285
 URL: https://issues.apache.org/jira/browse/SPARK-18285
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: zhengruifeng


approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2016-11-04 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638489#comment-15638489
 ] 

zhengruifeng commented on SPARK-13677:
--

It is shown in sklearn's doc here 
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html
 ) that GBDT+LiR often bring high score than GBDT.

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14047) GBT improvement umbrella

2016-11-04 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638475#comment-15638475
 ] 

zhengruifeng commented on SPARK-14047:
--

I personally think SPARK-15581 may be a improvement about GBT.
It is now a main method to use LoR/FM with GBT's Leaf transformation.

> GBT improvement umbrella
> 
>
> Key: SPARK-14047
> URL: https://issues.apache.org/jira/browse/SPARK-14047
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for improvements to learning Gradient Boosted Trees: 
> GBTClassifier, GBTRegressor.
> Note: Aspects of GBTs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14047) GBT improvement umbrella

2016-11-04 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638475#comment-15638475
 ] 

zhengruifeng edited comment on SPARK-14047 at 11/5/16 3:21 AM:
---

I personally think SPARK-13677 may be a improvement about GBT.
It is now a main method to use LoR/FM with GBT's Leaf transformation.


was (Author: podongfeng):
I personally think SPARK-15581 may be a improvement about GBT.
It is now a main method to use LoR/FM with GBT's Leaf transformation.

> GBT improvement umbrella
> 
>
> Key: SPARK-14047
> URL: https://issues.apache.org/jira/browse/SPARK-14047
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for improvements to learning Gradient Boosted Trees: 
> GBTClassifier, GBTRegressor.
> Note: Aspects of GBTs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638470#comment-15638470
 ] 

Reynold Xin commented on SPARK-18258:
-

This makes sense. It's just extra information you want to be able to see what's 
going on.

Can you sketch the API out and put a proposal in the ticket description? 
Doesn't need to be very well thought out. It will move the discussion forward.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2016-11-04 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638459#comment-15638459
 ] 

zhengruifeng commented on SPARK-13677:
--

Since mllib is in maintenance status. If this feature will be included, the 
corresponding PR will be updated to focus on ML only.

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2016-11-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13677:
-
Summary: Support Tree-Based Feature Transformation for ML  (was: Support 
Tree-Based Feature Transformation for mllib)

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2016-11-04 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Description: 
The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.

I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
significant.
The MiniBatch KMeans is named XMeans in following lines.
{code}
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()

val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8

val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8

val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
{code}
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence 
have significant difference:
{code}
KMeans2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01)   90sec
{code}

With appropriate fraction, the bigger the dataset is, the higher speedup is.

The data used above have 8,100,000 samples, 784 features. It can be downloaded 
here 
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)

Comparison of the K-Means and MiniBatchKMeans on sklearn : 
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

  was:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.

I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
significant.
The MiniBatch KMeans is named XMeans in following lines.
{code}
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()

val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8

val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8

val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
{code}
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence 
have significant difference:
{code}
KMeans2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01)   90sec
{code}

With appropriate fraction, the bigger the dataset is, the higher speedup is.

The data used above have 8,100,000 samples, 784 features. It can be downloaded 
here 
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)


> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly 

[jira] [Created] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0

2016-11-04 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-18284:


 Summary: Scheme of DataFrame generated from RDD is diffrent 
between master and 2.0
 Key: SPARK-18284
 URL: https://issues.apache.org/jira/browse/SPARK-18284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


When the following program is executed, a schema of dataframe is different 
between master and branch 2.0. The result should be false.

{code:java}
val df = sparkContext.parallelize(1 to 8, 1).toDF()
df.printSchema
df.filter("value > 4").count

=== master ===
root
 |-- value: integer (nullable = true)

=== branch 2.0 ===
root
 |-- value: integer (nullable = false)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18256.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15756
[https://github.com/apache/spark/pull/15756]

> Improve performance of event log replay in HistoryServer based on profiler 
> results
> --
>
> Key: SPARK-18256
> URL: https://issues.apache.org/jira/browse/SPARK-18256
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> Profiling event log replay in the HistoryServer reveals Json4S control flow 
> exceptions and `Utils.getFormattedClassName` calls as significant 
> bottlenecks. Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638392#comment-15638392
 ] 

Apache Spark commented on SPARK-17748:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/15779

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638280#comment-15638280
 ] 

Felix Cheung commented on SPARK-15581:
--

This is a great next step if we could get more concrete on the roadmap.

I would really like to call out R specifically though. I think there are 
several aspects we could do better:
- Keeping up-to-date: There are many ML algorithms that are 3+ minor releases 
behind, how could we stay more up-to-date on new or changes and yet have stable 
enough Scala APIs to work with? Perhaps alternatively, how could we make R 
easier for ML contributors to work with?
- Supporting advanced users: As of now ML APIs in R are fairly straightforward 
and are missing any capability to configure pipeline with extractor, feature 
transformer, or even evaluator. How could we make it more adaptable?


> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose 

[jira] [Commented] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638254#comment-15638254
 ] 

Felix Cheung commented on SPARK-10523:
--

Is this still an issue? As Yanbo says, we now support string label for 
classification and we have multiclass logistic regression in R.


> SparkR formula syntax to turn strings/factors into numerics
> ---
>
> Key: SPARK-10523
> URL: https://issues.apache.org/jira/browse/SPARK-10523
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Reporter: Vincent Warmerdam
>
> In normal (non SparkR) R the formula syntax enables strings or factors to be 
> turned into dummy variables immediately when calling a classifier. This way, 
> the following R pattern is legal and often used:
> {code}
> library(magrittr) 
> df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6))
> glm(class ~ i, family = "binomial", data = df)
> {code}
> The glm method will know that `class` is a string/factor and handles it 
> appropriately by casting it to a 0/1 array before applying any machine 
> learning. SparkR doesn't do this. 
> {code}
> > ddf <- sqlContext %>% 
>   createDataFrame(df)
> > glm(class ~ i, family = "binomial", data = ddf)
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.IllegalArgumentException: Unsupported type for label: StringType
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185)
>   at 
> org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146)
>   at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46)
>   at 
> org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.refl
> {code}
> This can be fixed by doing a bit of manual labor. SparkR does accept booleans 
> as if they are integers here. 
> {code}
> > ddf <- ddf %>% 
>   withColumn("to_pred", .$class == "a") 
> > glm(to_pred ~ i, family = "binomial", data = ddf)
> {code}
> But this can become quite tedious, especially when you want to have models 
> that are using multiple classes that need classification. This is perhaps 
> less relevant for logistic regression (because it is a bit more like a 
> one-off classification approach) but it certainly is relevant if you would 
> want to use a formula for a randomforest and a column denotes, say, a type of 
> flower from the iris dataset. 
> Is there a good reason why this should not be a feature of formulas in Spark? 
> I am aware of issue 8774, which looks like it is adressing a similar theme 
> but a different issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18283) Add a test to make sure the default starting offset is latest

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18283:


Assignee: Tathagata Das  (was: Apache Spark)

> Add a test to make sure the default starting offset is latest
> -
>
> Key: SPARK-18283
> URL: https://issues.apache.org/jira/browse/SPARK-18283
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18283) Add a test to make sure the default starting offset is latest

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638228#comment-15638228
 ] 

Apache Spark commented on SPARK-18283:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15778

> Add a test to make sure the default starting offset is latest
> -
>
> Key: SPARK-18283
> URL: https://issues.apache.org/jira/browse/SPARK-18283
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18283) Add a test to make sure the default starting offset is latest

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18283:


Assignee: Apache Spark  (was: Tathagata Das)

> Add a test to make sure the default starting offset is latest
> -
>
> Key: SPARK-18283
> URL: https://issues.apache.org/jira/browse/SPARK-18283
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18283) Add a test to make sure the default starting offset is latest

2016-11-04 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-18283:
-

 Summary: Add a test to make sure the default starting offset is 
latest
 Key: SPARK-18283
 URL: https://issues.apache.org/jira/browse/SPARK-18283
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638190#comment-15638190
 ] 

Apache Spark commented on SPARK-18282:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15777

> Add model summaries for Python GMM and BisectingKMeans
> --
>
> Key: SPARK-18282
> URL: https://issues.apache.org/jira/browse/SPARK-18282
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GaussianMixtureModel and BisectingKMeansModel in python do not have model 
> summaries, but they are implemented in Scala. We should add them for API 
> parity before 2.1 release. After the QA Jiras are created, this can be linked 
> as a subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18282:


Assignee: Apache Spark

> Add model summaries for Python GMM and BisectingKMeans
> --
>
> Key: SPARK-18282
> URL: https://issues.apache.org/jira/browse/SPARK-18282
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> GaussianMixtureModel and BisectingKMeansModel in python do not have model 
> summaries, but they are implemented in Scala. We should add them for API 
> parity before 2.1 release. After the QA Jiras are created, this can be linked 
> as a subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18282:


Assignee: (was: Apache Spark)

> Add model summaries for Python GMM and BisectingKMeans
> --
>
> Key: SPARK-18282
> URL: https://issues.apache.org/jira/browse/SPARK-18282
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GaussianMixtureModel and BisectingKMeansModel in python do not have model 
> summaries, but they are implemented in Scala. We should add them for API 
> parity before 2.1 release. After the QA Jiras are created, this can be linked 
> as a subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans

2016-11-04 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18282:


 Summary: Add model summaries for Python GMM and BisectingKMeans
 Key: SPARK-18282
 URL: https://issues.apache.org/jira/browse/SPARK-18282
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Reporter: Seth Hendrickson
Priority: Minor


GaussianMixtureModel and BisectingKMeansModel in python do not have model 
summaries, but they are implemented in Scala. We should add them for API parity 
before 2.1 release. After the QA Jiras are created, this can be linked as a 
subtask.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17710) ReplSuite fails with ClassCircularityError in master Maven builds

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638114#comment-15638114
 ] 

Apache Spark commented on SPARK-17710:
--

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15776

> ReplSuite fails with ClassCircularityError in master Maven builds
> -
>
> Key: SPARK-17710
> URL: https://issues.apache.org/jira/browse/SPARK-17710
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Weiqing Yang
>Priority: Critical
> Fix For: 2.1.0
>
>
> The master Maven build is currently broken because ReplSuite consistently 
> fails with ClassCircularityErrors. See 
> https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.3 for a 
> timeline of the failure.
> Here's the first build where this failed: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.3/2000/
> This appears to correspond to 
> https://github.com/apache/spark/commit/6a68c5d7b4eb07e4ed6b702dd1536cd08d9bba7d
> The same tests pass in the SBT build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-11-04 Thread Luke Miner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Miner updated SPARK-18281:
---
Description: 
I run the example straight out of the api docs for toLocalIterator and it gives 
a time out exception:

{code}
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize(range(10))
[x for x in rdd.toLocalIterator()]
{code}

conf file:
spark.driver.maxResultSize 6G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
-XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory   16G
spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
spark.hadoop.fs.s3a.connection.timeout 50
spark.hadoop.fs.s3n.multipart.uploads.enabled   true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadatafalse
spark.jars.packages 
com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
spark.local.dir /raid0/spark
spark.mesos.coarse  false
spark.mesos.constraints  priority:1
spark.network.timeout   600
spark.rpc.message.maxSize500
spark.speculation   false
spark.sql.parquet.mergeSchema   false
spark.sql.planner.externalSort  true
spark.submit.deployMode client
spark.task.cpus 1

Exception here:
{code}
---
timeout   Traceback (most recent call last)
 in ()
  2 sc = SparkContext()
  3 rdd = sc.parallelize(range(10))
> 4 [x for x in rdd.toLocalIterator()]

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
_load_from_socket(port, serializer)
140 try:
141 rf = sock.makefile("rb", 65536)
--> 142 for item in serializer.load_stream(rf):
143 yield item
144 finally:

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
load_stream(self, stream)
137 while True:
138 try:
--> 139 yield self._read_with_length(stream)
140 except EOFError:
141 return

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
_read_with_length(self, stream)
154 
155 def _read_with_length(self, stream):
--> 156 length = read_int(stream)
157 if length == SpecialLengths.END_OF_DATA_SECTION:
158 raise EOFError

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
read_int(stream)
541 
542 def read_int(stream):
--> 543 length = stream.read(4)
544 if not length:
545 raise EOFError

/usr/lib/python2.7/socket.pyc in read(self, size)
378 # fragmentation issues on many platforms.
379 try:
--> 380 data = self._sock.recv(left)
381 except error, e:
382 if e.args[0] == EINTR:

timeout: timed out
{code}



  was:
I run the example straight out of the api docs for {code}toLocalIterator{code} 
and it gives a time out exception:

{code}
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize(range(10))
[x for x in rdd.toLocalIterator()]
{code}

conf file:
spark.driver.maxResultSize 6G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
-XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory   16G
spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
spark.hadoop.fs.s3a.connection.timeout 50
spark.hadoop.fs.s3n.multipart.uploads.enabled   true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadatafalse
spark.jars.packages 
com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
spark.local.dir /raid0/spark
spark.mesos.coarse  false
spark.mesos.constraints  priority:1
spark.network.timeout   600
spark.rpc.message.maxSize500
spark.speculation   false
spark.sql.parquet.mergeSchema   false
spark.sql.planner.externalSort  true
spark.submit.deployMode client
spark.task.cpus 1

Exception here:
{code}
---
timeout   Traceback (most recent call last)
 in ()
  2 sc = SparkContext()
  3 rdd = sc.parallelize(range(10))
> 4 [x for x in rdd.toLocalIterator()]

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
_load_from_socket(port, serializer)
140 try:
141 rf = sock.makefile("rb", 65536)
--> 142 for item in serializer.load_stream(rf):
143 yield item
144 

[jira] [Created] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-11-04 Thread Luke Miner (JIRA)
Luke Miner created SPARK-18281:
--

 Summary: toLocalIterator yields time out error on pyspark2
 Key: SPARK-18281
 URL: https://issues.apache.org/jira/browse/SPARK-18281
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.1
 Environment: Ubuntu 14.04.5 LTS
Driver: AWS M4.XLARGE
Slaves: AWS M4.4.XLARGE
mesos 1.0.1
spark 2.0.1
pyspark
Reporter: Luke Miner


I run the example straight out of the api docs for {code}toLocalIterator{code} 
and it gives a time out exception:

{code}
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize(range(10))
[x for x in rdd.toLocalIterator()]
{code}

conf file:
spark.driver.maxResultSize 6G
spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
-XX:+HeapDumpOnOutOfMemoryError
spark.executor.memory   16G
spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
spark.hadoop.fs.s3a.connection.timeout 50
spark.hadoop.fs.s3n.multipart.uploads.enabled   true
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.parquet.block.size 2147483648
spark.hadoop.parquet.enable.summary-metadatafalse
spark.jars.packages 
com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
spark.local.dir /raid0/spark
spark.mesos.coarse  false
spark.mesos.constraints  priority:1
spark.network.timeout   600
spark.rpc.message.maxSize500
spark.speculation   false
spark.sql.parquet.mergeSchema   false
spark.sql.planner.externalSort  true
spark.submit.deployMode client
spark.task.cpus 1

Exception here:
{code}
---
timeout   Traceback (most recent call last)
 in ()
  2 sc = SparkContext()
  3 rdd = sc.parallelize(range(10))
> 4 [x for x in rdd.toLocalIterator()]

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
_load_from_socket(port, serializer)
140 try:
141 rf = sock.makefile("rb", 65536)
--> 142 for item in serializer.load_stream(rf):
143 yield item
144 finally:

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
load_stream(self, stream)
137 while True:
138 try:
--> 139 yield self._read_with_length(stream)
140 except EOFError:
141 return

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
_read_with_length(self, stream)
154 
155 def _read_with_length(self, stream):
--> 156 length = read_int(stream)
157 if length == SpecialLengths.END_OF_DATA_SECTION:
158 raise EOFError

/foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
read_int(stream)
541 
542 def read_int(stream):
--> 543 length = stream.read(4)
544 if not length:
545 raise EOFError

/usr/lib/python2.7/socket.pyc in read(self, size)
378 # fragmentation issues on many platforms.
379 try:
--> 380 data = self._sock.recv(left)
381 except error, e:
382 if e.args[0] == EINTR:

timeout: timed out
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-11-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16804:

Fix Version/s: 2.0.2

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
> Fix For: 2.0.2, 2.1.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17337:

Fix Version/s: 2.0.2

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Herman van Hovell
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637905#comment-15637905
 ] 

Apache Spark commented on SPARK-16804:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15772

> Correlated subqueries containing non-deterministic operators return incorrect 
> results
> -
>
> Key: SPARK-16804
> URL: https://issues.apache.org/jira/browse/SPARK-16804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Nattavut Sutyanyong
> Fix For: 2.1.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Correlated subqueries with LIMIT could return incorrect results. The rule 
> ResolveSubquery in the Analysis phase moves correlated predicates to a join 
> predicates and neglect the semantic of the LIMIT.
> Example:
> {noformat}
> Seq(1, 2).toDF("c1").createOrReplaceTempView("t1")
> Seq(1, 2).toDF("c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT 
> 1)").show
> +---+ 
>   
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result contains both rows from T1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18280:


Assignee: (was: Apache Spark)

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637873#comment-15637873
 ] 

Apache Spark commented on SPARK-18280:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15775

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18280:


Assignee: Apache Spark

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18280:


 Summary: Potential deadlock in `StandaloneSchedulerBackend.dead`
 Key: SPARK-18280
 URL: https://issues.apache.org/jira/browse/SPARK-18280
 Project: Spark
  Issue Type: Bug
Reporter: Shixiong Zhu


"StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
until all RPC threads exit, if it's called inside a RPC thread, it will be 
dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`

2016-11-04 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18280:
-
Affects Version/s: 1.6.2
   2.0.0
   2.0.1

> Potential deadlock in `StandaloneSchedulerBackend.dead`
> ---
>
> Key: SPARK-18280
> URL: https://issues.apache.org/jira/browse/SPARK-18280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>
> "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not 
> call "SparkContext.stop" in the same thread. "SparkContext.stop" will block 
> until all RPC threads exit, if it's called inside a RPC thread, it will be 
> dead-lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637824#comment-15637824
 ] 

Apache Spark commented on SPARK-18189:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/15774

> task not serializable with groupByKey() + mapGroups() + map
> ---
>
> Key: SPARK-18189
> URL: https://issues.apache.org/jira/browse/SPARK-18189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yang Yang
>Assignee: Ergin Seyfe
> Fix For: 2.0.2, 2.1.0
>
>
> just run the following code 
> {code}
> val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)]
> val grouped = a.groupByKey({x:(Int,Int)=>x._1})
> val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)})
> val yyy = sc.broadcast(1)
> val last = mappedGroups.rdd.map(xx=>{
>   val simpley = yyy.value
>   
>   1
> })
> {code}
> spark says Task not serializable



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18279) ML programming guide should have R examples

2016-11-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18279:
-
Affects Version/s: 2.1.0
 Target Version/s: 2.1.0

> ML programming guide should have R examples
> ---
>
> Key: SPARK-18279
> URL: https://issues.apache.org/jira/browse/SPARK-18279
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> http://spark.apache.org/docs/latest/ml-classification-regression.html
> for example, should have the R tab with examples.
> This should be done on all ML that has the R API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18279) ML programming guide should have R examples

2016-11-04 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-18279:
-
Description: 
http://spark.apache.org/docs/latest/ml-classification-regression.html
for example, should have the R tab with examples, just like Scala/Java/Python.

This should be done on all ML that has the R API.


  was:
http://spark.apache.org/docs/latest/ml-classification-regression.html
for example, should have the R tab with examples.

This should be done on all ML that has the R API.



> ML programming guide should have R examples
> ---
>
> Key: SPARK-18279
> URL: https://issues.apache.org/jira/browse/SPARK-18279
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>
> http://spark.apache.org/docs/latest/ml-classification-regression.html
> for example, should have the R tab with examples, just like Scala/Java/Python.
> This should be done on all ML that has the R API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18279) ML programming guide should have R examples

2016-11-04 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18279:


 Summary: ML programming guide should have R examples
 Key: SPARK-18279
 URL: https://issues.apache.org/jira/browse/SPARK-18279
 Project: Spark
  Issue Type: Bug
  Components: ML, SparkR
Reporter: Felix Cheung


http://spark.apache.org/docs/latest/ml-classification-regression.html
for example, should have the R tab with examples.

This should be done on all ML that has the R API.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637756#comment-15637756
 ] 

Felix Cheung commented on SPARK-18266:
--

Actually, I just realize the ML programming guide (not just the SparkR 
programming guide) should also talk about R APIs.
Opened https://issues.apache.org/jira/browse/SPARK-18279


> Update R vignettes and programming guide for 2.1.0 release
> --
>
> Key: SPARK-18266
> URL: https://issues.apache.org/jira/browse/SPARK-18266
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar closed SPARK-18273.


> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Bhatnagar resolved SPARK-18273.
--
Resolution: Not A Problem

Glob patterns can be passed instead of full paths to reduce the numbers of 
paths passed in to load method.

> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637731#comment-15637731
 ] 

Aniket Bhatnagar commented on SPARK-18273:
--

Thanks [~srowen]. Didn't realize that I could actually pass glob pattern. Thank 
you so much.

> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637713#comment-15637713
 ] 

Nicholas Chammas commented on SPARK-18277:
--

{quote}
If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is 
None, '')}} because {{Column}} doesn't implement the appropriate protocol for 
{{is}}.
{quote}

Ah my bad, in this case the appropriate thing to do is 
{{when(col('a.b').isNull(), '')}}. So there is a workaround available today via 
{{when()}} and {{isNull()}}.

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18276:


Assignee: Apache Spark

> Some ML training summaries are not copied when {{copy()}} is called.
> 
>
> Key: SPARK-18276
> URL: https://issues.apache.org/jira/browse/SPARK-18276
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression 
> models do not copy their training summaries inside the {{copy}} method. In 
> contrast, Linear/Logistic regression models do. They should all be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637671#comment-15637671
 ] 

Apache Spark commented on SPARK-18276:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15773

> Some ML training summaries are not copied when {{copy()}} is called.
> 
>
> Key: SPARK-18276
> URL: https://issues.apache.org/jira/browse/SPARK-18276
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression 
> models do not copy their training summaries inside the {{copy}} method. In 
> contrast, Linear/Logistic regression models do. They should all be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18276:


Assignee: (was: Apache Spark)

> Some ML training summaries are not copied when {{copy()}} is called.
> 
>
> Key: SPARK-18276
> URL: https://issues.apache.org/jira/browse/SPARK-18276
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression 
> models do not copy their training summaries inside the {{copy}} method. In 
> contrast, Linear/Logistic regression models do. They should all be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637654#comment-15637654
 ] 

Nicholas Chammas commented on SPARK-18277:
--

Thanks for the pointer. I'll follow the discussion there.

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-18278:
---
External issue URL: https://github.com/kubernetes/kubernetes/issues/34377
 External issue ID: #34377

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637621#comment-15637621
 ] 

Cody Koeninger commented on SPARK-18258:


So one obvious one is that if wherever checkpoint data is being stored fails or 
is corrupted, my downstream database can still be fine and have correct 
results, yet I have no way of restarting the job from a known point because the 
batch id stored in the database is now meaningless.

Basically, I do not want to introduce another N points of failure in between 
Kafka and my data store.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637622#comment-15637622
 ] 

Michael Armbrust commented on SPARK-18277:
--

We've been talking about better support for nested data for 2.2, [SPARK-16483].

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14387:
--
Target Version/s:   (was: 2.0.2)

> Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
> -
>
> Key: SPARK-14387
> URL: https://issues.apache.org/jira/browse/SPARK-14387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>
> In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB 
> scale. Initially I got the following exception (as FileScanRDD has been made 
> the default in master branch)
> {noformat}
> 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. 
> java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
> {noformat}
> When running with "spark.sql.sources.fileScan=false", following exception is 
> thrown
> {noformat}
> 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
> at 
> 

[jira] [Comment Edited] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637604#comment-15637604
 ] 

Felix Cheung edited comment on SPARK-18266 at 11/4/16 8:38 PM:
---

I'm not sure it is, actually. If I recall there shouldn't be a lot of changes 
to the SparkR programming guide for this release - probably just add to the 
list of ML.
For vignettes, we need to add to the list the new ML functionality we are 
adding (including explanation, examples and so on).

In fact spark.logit is one of the new thing we need to document. You are 
welcome to take this on or split this up to start on this.



was (Author: felixcheung):
I'm not sure it is, actually. If I recall there shouldn't be a lot of (if any?) 
changes to the SparkR programming guide for this release.
For vignettes, we need to add to the list the new ML functionality we are 
adding (including explanation, examples and so on).

In fact spark.logit is one of the new thing we need to document. You are 
welcome to take this on or split this up to start on this.


> Update R vignettes and programming guide for 2.1.0 release
> --
>
> Key: SPARK-18266
> URL: https://issues.apache.org/jira/browse/SPARK-18266
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-14241:
--
Assignee: Cheng Lian

> Output of monotonically_increasing_id lacks stable relation with rows of 
> DataFrame
> --
>
> Key: SPARK-14241
> URL: https://issues.apache.org/jira/browse/SPARK-14241
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Paul Shearer
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> If you use monotonically_increasing_id() to append a column of IDs to a 
> DataFrame, the IDs do not have a stable, deterministic relationship to the 
> rows they are appended to. A given ID value can land on different rows 
> depending on what happens in the task graph:
> http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321
> From a user perspective this behavior is very unexpected, and many things one 
> would normally like to do with an ID column are in fact only possible under 
> very narrow circumstances. The function should either be made deterministic, 
> or there should be a prominent warning note in the API docs regarding its 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release

2016-11-04 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637604#comment-15637604
 ] 

Felix Cheung commented on SPARK-18266:
--

I'm not sure it is, actually. If I recall there shouldn't be a lot of (if any?) 
changes to the SparkR programming guide for this release.
For vignettes, we need to add to the list the new ML functionality we are 
adding (including explanation, examples and so on).

In fact spark.logit is one of the new thing we need to document. You are 
welcome to take this on or split this up to start on this.


> Update R vignettes and programming guide for 2.1.0 release
> --
>
> Key: SPARK-18266
> URL: https://issues.apache.org/jira/browse/SPARK-18266
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-17337:
-

Assignee: Herman van Hovell

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Herman van Hovell
>  Labels: correctness
> Fix For: 2.1.0
>
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637600#comment-15637600
 ] 

Apache Spark commented on SPARK-17337:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15772

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
> Fix For: 2.1.0
>
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637602#comment-15637602
 ] 

Anirudh Ramanathan commented on SPARK-18278:


Corresponding issue in kubernetes:
https://github.com/kubernetes/kubernetes/issues/34377

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17337.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Herman van Hovell
>  Labels: correctness
> Fix For: 2.1.0
>
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637595#comment-15637595
 ] 

Michael Armbrust commented on SPARK-18258:
--

What sort of failures are you anticipating here?

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637591#comment-15637591
 ] 

Erik Erlandson commented on SPARK-18278:


Current prototype:
https://github.com/foxish/spark/tree/k8s-support
https://github.com/foxish/spark/pull/1

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637576#comment-15637576
 ] 

Cody Koeninger commented on SPARK-18258:


The sink doesn't have to reason about equality of the representations.

It just has to be able to store those representations, in addition the batch id 
if necessary, so that the job can be recovered if spark fails in a way that 
renders the batch id meaningless or the user wants to switch to a different 
streaming system.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566
 ] 

Nicholas Chammas edited comment on SPARK-18277 at 11/4/16 8:25 PM:
---

[~marmbrus] / [~yhuai]: Is there a workaround for this available today? Also, 
do you think {{fill()}} should fit this use case down the line?


was (Author: nchammas):
[~marmbrus] / [~yhuai]: Is there is workaround for this available today? Also, 
do you think {{fill()}} should fit this use case down the line?

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-04 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-18278:
--

 Summary: Support native submission of spark jobs to a kubernetes 
cluster
 Key: SPARK-18278
 URL: https://issues.apache.org/jira/browse/SPARK-18278
 Project: Spark
  Issue Type: Umbrella
  Components: Build, Deploy, Documentation, Scheduler, Spark Core
Affects Versions: 2.2.0
Reporter: Erik Erlandson


A new Apache Spark sub-project that enables native support for submitting Spark 
applications to a kubernetes cluster.   The submitted application runs in a 
driver executing on a kubernetes pod, and executors lifecycles are also managed 
as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18277:
-
Description: 
It appears that you cannot use {{fill()}} and friends to quickly modify struct 
fields.

For example:

{code}
>>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
>>> Row(a=Row(b=None), c=None)])
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: string (nullable = true)
 |-- c: string (nullable = true)

>>> df.show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   null|
+---+---+

>>> df.na.fill('').show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   |
+---+---+
{code}

{{c}} got filled in, but {{a.b}} didn't.

I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
friends worked automatically on struct fields.

As things are today, there doesn't appear to be a way to fill in null values 
inside structs. If you try {{when()}}, you realize that you cannot do 
{{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
'')}} it doesn't catch the null values.

  was:
It appears that you cannot use {{fill()}} and friends to quickly modify struct 
fields.

For example:

{code}
>>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
>>> Row(a=Row(b=None), c=None)])
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: string (nullable = true)
 |-- c: string (nullable = true)

>>> df.show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   null|
+---+---+

>>> df.na.fill('').show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   |
+---+---+
{code}

{{c}} got filled in, but {{a.b}} didn't.

I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
friends worked automatically on struct fields.

As things are today, there doesn't appear to be a way to fill in null values 
inside structs.

If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is 
None, '')}} because {{Column}} doesn't implement the appropriate protocol for 
{{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the 
null values.


> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs. If you try {{when()}}, you realize that you cannot do 
> {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the 
> appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, 
> '')}} it doesn't catch the null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566
 ] 

Nicholas Chammas commented on SPARK-18277:
--

[~marmbrus] / [~yhuai]: Is there is workaround for this available today? Also, 
do you think {{fill()}} should fit this use case down the line?

> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs.
> If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is 
> None, '')}} because {{Column}} doesn't implement the appropriate protocol for 
> {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the 
> null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18277:
-
Description: 
It appears that you cannot use {{fill()}} and friends to quickly modify struct 
fields.

For example:

{code}
>>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
>>> Row(a=Row(b=None), c=None)])
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: string (nullable = true)
 |-- c: string (nullable = true)

>>> df.show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   null|
+---+---+

>>> df.na.fill('').show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   |
+---+---+
{code}

{{c}} got filled in, but {{a.b}} didn't.

I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
friends worked automatically on struct fields.

As things are today, there doesn't appear to be a way to fill in null values 
inside structs.

If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is 
None, '')}} because {{Column}} doesn't implement the appropriate protocol for 
{{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the 
null values.

  was:
It appears that you cannot use {{fill()}} and friends to quickly modify struct 
fields.

For example:

{code}
>>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
>>> Row(a=Row(b=None), c=None)])
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: string (nullable = true)
 |-- c: string (nullable = true)

>>> df.show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   null|
+---+---+

>>> df.na.fill('').show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   |
+---+---+
{code}

{{c}} got filled in, but {{a.b}} didn't.

I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
friends worked automatically on struct fields.

As things are today, it appears that you have to manually unpack and rebuild 
structs to do things like fill in missing field values.


> na.fill() and friends should work on struct fields
> --
>
> Key: SPARK-18277
> URL: https://issues.apache.org/jira/browse/SPARK-18277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It appears that you cannot use {{fill()}} and friends to quickly modify 
> struct fields.
> For example:
> {code}
> >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
> >>> Row(a=Row(b=None), c=None)])
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: string (nullable = true)
>  |-- c: string (nullable = true)
> >>> df.show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   null|
> +---+---+
> >>> df.na.fill('').show()
> +---+---+
> |  a|  c|
> +---+---+
> |[yeah yeah]|alright|
> | [null]|   |
> +---+---+
> {code}
> {{c}} got filled in, but {{a.b}} didn't.
> I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
> friends worked automatically on struct fields.
> As things are today, there doesn't appear to be a way to fill in null values 
> inside structs.
> If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is 
> None, '')}} because {{Column}} doesn't implement the appropriate protocol for 
> {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the 
> null values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637493#comment-15637493
 ] 

Yun Ni commented on SPARK-18081:


This is super helpful. Thanks!

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18277) na.fill() and friends should work on struct fields

2016-11-04 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18277:


 Summary: na.fill() and friends should work on struct fields
 Key: SPARK-18277
 URL: https://issues.apache.org/jira/browse/SPARK-18277
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nicholas Chammas
Priority: Minor


It appears that you cannot use {{fill()}} and friends to quickly modify struct 
fields.

For example:

{code}
>>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), 
>>> Row(a=Row(b=None), c=None)])
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: string (nullable = true)
 |-- c: string (nullable = true)

>>> df.show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   null|
+---+---+

>>> df.na.fill('').show()
+---+---+
|  a|  c|
+---+---+
|[yeah yeah]|alright|
| [null]|   |
+---+---+
{code}

{{c}} got filled in, but {{a.b}} didn't.

I don't know if it's "appropriate", but it would be nice if {{fill()}} and 
friends worked automatically on struct fields.

As things are today, it appears that you have to manually unpack and rebuild 
structs to do things like fill in missing field values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637491#comment-15637491
 ] 

Michael Armbrust commented on SPARK-18258:
--

I agree that we don't want to lock people in, which is why a goal of 
[SPARK-17829] was to make the offset representation user readable.

Given that they are accesible in this way (we should probably document this, 
and make it a long term contract), I don't think that we want to widen the Sink 
API, to expose the internal details of the various sources. Exposing more than 
we need to leaks details and could lead to a more brittle system. This is why I 
think its safer to use {{batchId}} as a proxy to achieve transactional 
semantics. 

Consider the case where some source returns the offsets: {{a: 1, b: 2}} but 
upon recovery it returns {{b: 2, a: 1}}.  This is a little weird, but as long 
as they implement {{getBatch}} correctly, there are no correctness issues in 
this Source.  However, with this proposal, the sink is now responsible for 
reasoning about equality of these representations. In contrast, its trivial to 
reason about equality in the current API.

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.

2016-11-04 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-18276:


 Summary: Some ML training summaries are not copied when {{copy()}} 
is called.
 Key: SPARK-18276
 URL: https://issues.apache.org/jira/browse/SPARK-18276
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Seth Hendrickson
Priority: Minor


GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression 
models do not copy their training summaries inside the {{copy}} method. In 
contrast, Linear/Logistic regression models do. They should all be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18258:
-
Target Version/s: 2.2.0

> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637424#comment-15637424
 ] 

Seth Hendrickson commented on SPARK-18081:
--

No worries, just wanted to check in to see if you had bandwidth to do it.

You can get a preview of the user guide by building the docs with jekyll 
{{SKIP_API=1 jekyll build}} inside the docs directory. For more detail, please 
see [the readme|https://github.com/apache/spark/tree/master/docs]

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18197) Optimise AppendOnlyMap implementation

2016-11-04 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18197.
-
   Resolution: Fixed
 Assignee: Adam Roberts
Fix Version/s: 2.1.0

> Optimise AppendOnlyMap implementation
> -
>
> Key: SPARK-18197
> URL: https://issues.apache.org/jira/browse/SPARK-18197
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Adam Roberts
>Assignee: Adam Roberts
>Priority: Minor
> Fix For: 2.1.0
>
>
> This improvement works by using the cheapest comparison test first and we 
> observed a 1% performance on PageRank (HiBench large) with this change
> tprof output before the change follows, AppendOnlyMap.changeValue is where 
> the optimisation occurs:
> {code}
> PID 256059 22.86java_1fa29
> MOD  81337  7.26 JITCODE
>  SYM  11250  1.00  
> java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V_7fe098983af4
>  SYM   8053  0.72  
> org/apache/spark/util/collection/AppendOnlyMap.changeValue(Ljava/lang/Object;Lscala/Function2;)Ljava/lang/Object;_7fe098c211e8
>  SYM   5175  0.46  
> java/lang/String.equals(Ljava/lang/Object;)Z_7fe0989eb2e8
>  SYM   3616  0.32  
> org/apache/spark/util/SizeEstimator$.estimate(Ljava/lang/Object;Ljava/util/IdentityHashMap;)J_7fe098bc35a8
>  SYM   3235  0.29  
> org/apache/spark/util/collection/ExternalSorter$$anonfun$4$$anon$6.compare(Ljava/lang/Object;Ljava/lang/Object;)I_7fe098c855a8
>  SYM   3182  0.28  
> java/io/ObjectInputStream$BlockDataInputStream.readUTFBody(J)Ljava/lang/String;_7fe098980ec8
>  SYM   3111  0.28  
> org/apache/spark/util/SizeEstimator$SearchState.enqueue(Ljava/lang/Object;)V_7fe0989f2920
> {code}
> tprof after the change
> {code}
> MOD  56804  5.07 JITCODE
>  SYM   8766  0.78  
> java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V_7f0088bb2034
>  SYM   5746  0.51  
> java/io/ObjectStreamClass.lookup(Ljava/lang/Class;Z)Ljava/io/ObjectStreamClass;_7f0088944ae8
>  SYM   3378  0.30  
> java/io/ObjectInputStream.readObject0(Z)Ljava/lang/Object;_7f0088c5ed00
>  SYM   3121  0.28  
> java/io/ObjectInputStream$BlockDataInputStream.readUTFBody(J)Ljava/lang/String;_7f0088c6de08
>  SYM   2857  0.26  
> org/apache/spark/storage/BufferReleasingInputStream.read([BII)I_7f0088b3e3a8
>  SYM   2786  0.25  
> org/apache/spark/util/collection/AppendOnlyMap.changeValue(Ljava/lang/Object;Lscala/Function2;)Ljava/lang/Object;_7f008899b048
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18260:


Assignee: Apache Spark

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>Priority: Blocker
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637290#comment-15637290
 ] 

Apache Spark commented on SPARK-18260:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15771

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Blocker
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18260:


Assignee: (was: Apache Spark)

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Blocker
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18273:
--
Priority: Minor  (was: Major)

I'm not sure it's worth the complexity. How about passing a glob parameter?

> DataFrameReader.load takes a lot of time to start the job if a lot of 
> file/dir paths are pass 
> --
>
> Key: SPARK-18273
> URL: https://issues.apache.org/jira/browse/SPARK-18273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> If the paths Seq parameter contains a lot of elements, then 
> DataFrameReader.load takes a lot of time starting the job as it attempts to 
> check if each of the path exists using fs.exists. There should be a boolean 
> configuration  option to disable the checking for path's existence and that 
> should be passed in as parameter to DataSource.resolveRelation call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release

2016-11-04 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637204#comment-15637204
 ] 

Miao Wang commented on SPARK-18266:
---

[~felixcheung] Is this an umbrella JIRA?

> Update R vignettes and programming guide for 2.1.0 release
> --
>
> Key: SPARK-18266
> URL: https://issues.apache.org/jira/browse/SPARK-18266
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18193.
---
Resolution: Not A Problem

I looked into this further and found that it does work to add RDDs after it's 
started in Scala, but not Java. This is because the Java implementation makes a 
copy of the queue. This is consistent with the docs and examples actually. So 
I'm re-closing as not a problem; I do not believe this should be reopened.

> queueStream not updated if rddQueue.add after create queueStream in Java
> 
>
> Key: SPARK-18193
> URL: https://issues.apache.org/jira/browse/SPARK-18193
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Hubert Kang
>
> Within 
> examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java,
>  no any data is deteceted if below code to put something to rddQueue is 
> executed after queueStream is created (line 65).
> for (int i = 0; i < 30; i++) {
>   rddQueue.add(ssc.sparkContext().parallelize(list));
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18065.
---
Resolution: Not A Problem

> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Matthew Scruggs
>Priority: Minor
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two".selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17969.
---
Resolution: Won't Fix

> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17945) Writing to S3 should allow setting object metadata

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17945.
---
Resolution: Won't Fix

> Writing to S3 should allow setting object metadata
> --
>
> Key: SPARK-17945
> URL: https://issues.apache.org/jira/browse/SPARK-17945
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Jeff Schobelock
>Priority: Minor
>
> I can't find any possible way to use Spark to write to S3 and set user object 
> metadata. This seems like such a simple thing that I feel I must be missing 
> somewhere how to do itbut I have yet to find anything.
> I don't know what all work adding this would entail. My idea would be that 
> there is something like:
> rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map 
> data).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-04 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637189#comment-15637189
 ] 

Miao Wang commented on SPARK-15784:
---

I created a new PR to implement PIC as a Transformer.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637186#comment-15637186
 ] 

Nattavut Sutyanyong commented on SPARK-17337:
-

Totally agreed on your approach. We should close off any potential incorrect 
results at the soonest possible.

Thanks for the advise on the approach of different/competing PRs. I am 
relatively new to the community and try to be careful not to break any 
etiquette of the community.

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637167#comment-15637167
 ] 

Herman van Hovell commented on SPARK-17337:
---

[~nsyca] You are right to say that this is part of a larger set of problems 
which we should address and which given the amount of attempts made to fix this 
proves that it it non-trivial. It is not a problem to have different 
(competing) PRs for this problem, this usually leads to solutions. So please 
submit a PR is feel that this is a better solution.

However, for this ticket I just want to solve the correctness issue at hand.

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637132#comment-15637132
 ] 

Apache Spark commented on SPARK-15784:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/15770

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18128:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-18267

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637110#comment-15637110
 ] 

holdenk commented on SPARK-18128:
-

When I e-mailed [~prabinb] earlier this week I got an out of office auto 
response, so if we don't here back we should maybe ping again next week :)

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637111#comment-15637111
 ] 

holdenk commented on SPARK-18128:
-

Sure

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-11-04 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637102#comment-15637102
 ] 

Yun Ni commented on SPARK-18081:


Sorry, I was really overloaded this week. I will try my best to send a PR 
before EoW.

BTW, when I was writing ml-lsh.md, I could not find a place to get a preview. 
Do you have any guidelines for preview the Spark docs?

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637104#comment-15637104
 ] 

holdenk commented on SPARK-18128:
-

Good call - so publishing to PyPI test has worked fine but there might be a 
different production limit and or we might be close to some file size limit 
that we will hit in the future and it would be better to have an exemption in 
place first.

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-04 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17337:
--
Labels: correctness  (was: )

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-11-04 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637099#comment-15637099
 ] 

Herman van Hovell commented on SPARK-17348:
---

Yeah, it would be nice if you can consolidate the two, and put your current fix 
there as well.

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18275) Why does not use an ordered queue in takeOrdered?

2016-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18275.
---
Resolution: Not A Problem

Questions should go to user@

Priority queues are not necessarily sorted and in fact are generally not sorted 
because they use a heap representation internally. It's not true that inserts 
are O(log k) if you are maintaining a sorted data structure.

> Why does not use an ordered queue in takeOrdered?
> -
>
> Key: SPARK-18275
> URL: https://issues.apache.org/jira/browse/SPARK-18275
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: xubo245
>Priority: Minor
>
> Every partition in mapRDDs is defined as BoundedPriorityQueue object :
> val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
> in org.apache.spark.rdd.RDD#takeOrdered
> After mapRDDs.reduce,only return one queue and the queue is 
> BoundedPriorityQueue ,so  after toArray , Is it necessary to use sorted in 
> takeOrdered? If we can keep the queue is ordered,we can only use reverse
> The same as in org.apache.spark.util.collection.Utils#takeOrdered, the 
> leastOf method also use a unordered buffer ,why does not use a ordered queue? 
> we can insert a num in O(log k) ,but the traditional quickselect algorithm 
> take  O(k) time. Also we do not need a sort after selecting and  save  O(k * 
> log k)
> the leastOf is call 
> com.google.common.collect.Ordering#leastOf(java.util.Iterator, int)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI

2016-11-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636944#comment-15636944
 ] 

Nicholas Chammas commented on SPARK-18128:
--

For the record: Let's also check with the PyPI admins if we need some kind of 
exemption on any file size limits.

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-11-04 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636886#comment-15636886
 ] 

Nattavut Sutyanyong commented on SPARK-17348:
-

I'd like to a note that a piece of existing code that closes to address this 
problem but only done for the scalar subquery context can be found by searching 
this pattern in CheckAnalysis.scala (line 126):
_"The correlated scalar subquery can only contain equality predicates:"_

An existing test coverage for that scenario is at
_SubquerySuite/non-equal correlated scalar subquery_

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-11-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802
 ] 

György Süveges edited comment on SPARK-4563 at 11/4/16 4:13 PM:


+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, we change the spark.driver.host 
corresponding java property. But as noticed spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42

However in 
[WebUI|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L139]
 on master I can still see places where only SPARK_LOCAL_IP is checked only
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?



was (Author: gyorgy_suve...@epam.com):
+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, we change the spark.driver.host 
corresponding java property. But as noticed spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-11-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802
 ] 

György Süveges edited comment on SPARK-4563 at 11/4/16 4:01 PM:


+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, we change the spark.driver.host 
corresponding java property. But as noticed spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?


was (Author: gyorgy_suve...@epam.com):
+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, then we change the spark.driver.host 
corresponding java property. But as noticed spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-11-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802
 ] 

György Süveges edited comment on SPARK-4563 at 11/4/16 4:00 PM:


+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, then we change the spark.driver.host 
corresponding java property. But as noticed spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?


was (Author: gyorgy_suve...@epam.com):
+1
However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar.
We use AWS EMR5 (Spark 2.0 over YARN)
While developing we use spark from intellij in yarn-client mode. To connect to 
AWS, OpenVPN is applied with biderectional routing so the executors in amazon 
can access the driver on the local workstation. With OpenVPN there are multiple 
network interfaces then, so SPARK_LOCAL_IP should be set. But it's really 
inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get 
different IP.
So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env 
var and env vars are inmutable in java, then we change the spark.driver.host 
corresponding java property. But as notices spark.driver.host is only second 
order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into 
account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but 
ignores the spark.driver.host totally. I hope this improvement will fix this 
issue, as I saw in 
https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42
I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, 
shouldn't they?

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >