[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638555#comment-15638555 ] Cody Koeninger commented on SPARK-18258: Sure, added, let me know if I'm missing something or can clarify. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. > After SPARK-17829 is complete and offsets have a .json method, an api for > this ticket might look like > {code} > trait Sink { > def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: > OffsetSeq): Unit > {code} > where start and end were provided by StreamExecution.runBatch using > committedOffsets and availableOffsets. > I'm not 100% certain that the offsets in the seq could always be mapped back > to the correct source when restarting complicated multi-source jobs, but I > think it'd be sufficient. Passing the string/json representation of the seq > instead of the seq itself would probably be sufficient as well, but the > convention of rendering a None as "-" in the json is maybe a little > idiosyncratic to parse, and the constant defining that is private. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-18258: --- Description: Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. After SPARK-17829 is complete and offsets have a .json method, an api for this ticket might look like {code} trait Sink { def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: OffsetSeq): Unit {code} where start and end were provided by StreamExecution.runBatch using committedOffsets and availableOffsets. I'm not 100% certain that the offsets in the seq could always be mapped back to the correct source when restarting complicated multi-source jobs, but I think it'd be sufficient. Passing the string/json representation of the seq instead of the seq itself would probably be sufficient as well, but the convention of rendering a None as "-" in the json is maybe a little idiosyncratic to parse, and the constant defining that is private. was: Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. > After SPARK-17829 is complete and offsets have a .json method, an api for > this ticket might look like > {code} > trait Sink { > def addBatch(batchId: Long, data: DataFrame, start: OffsetSeq, end: > OffsetSeq): Unit > {code} > where start and end were provided by StreamExecution.runBatch using > committedOffsets and availableOffsets. > I'm not 100% certain that the offsets in the seq could always be mapped back > to the correct source when restarting complicated multi-source jobs, but I > think it'd be sufficient. Passing the string/json representation of the seq > instead of the seq itself would probably be sufficient as well, but the > convention of rendering a None as "-" in the json is maybe a little > idiosyncratic to parse, and the constant defining that is private. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads
[ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638546#comment-15638546 ] Felix Cheung commented on SPARK-12757: -- I'm seeing the same with latest master running a pipeline with GBTClassifier: {code} WARN Executor: 1 block locks were not released by TID = 7: [rdd_28_0] {code} to repro, take the code sample from the ml programming guide: {code} import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} // Load and parse the data file, converting it to a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Index labels, adding metadata to the label column. // Fit on whole dataset to include all labels in index. val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data) // Automatically identify categorical features, and index them. // Set maxCategories so features with > 4 distinct values are treated as continuous. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a GBT model. val gbt = new GBTClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") .setMaxIter(10) // Convert indexed labels back to original labels. val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) // Chain indexers and GBT in a Pipeline. val pipeline = new Pipeline() .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter)) // Train model. This also runs the indexers. val model = pipeline.fit(trainingData) {code} > Use reference counting to prevent blocks from being evicted during reads > > > Key: SPARK-12757 > URL: https://issues.apache.org/jira/browse/SPARK-12757 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > As a pre-requisite to off-heap caching of blocks, we need a mechanism to > prevent pages / blocks from being evicted while they are being read. With > on-heap objects, evicting a block while it is being read merely leads to > memory-accounting problems (because we assume that an evicted block is a > candidate for garbage-collection, which will not be true during a read), but > with off-heap memory this will lead to either data corruption or segmentation > faults. > To address this, we should add a reference-counting mechanism to track which > blocks/pages are being read in order to prevent them from being evicted > prematurely. I propose to do this in two phases: first, add a safe, > conservative approach in which all BlockManager.get*() calls implicitly > increment the reference count of blocks and where tasks' references are > automatically freed upon task completion. This will be correct but may have > adverse performance impacts because it will prevent legitimate block > evictions. In phase two, we should incrementally add release() calls in order > to fix the eviction of unreferenced blocks. The latter change may need to > touch many different components, which is why I propose to do it separately > in order to make the changes easier to reason about and review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18285) approxQuantile in R support multi-column
zhengruifeng created SPARK-18285: Summary: approxQuantile in R support multi-column Key: SPARK-18285 URL: https://issues.apache.org/jira/browse/SPARK-18285 Project: Spark Issue Type: Improvement Components: SparkR Reporter: zhengruifeng approxQuantile in R should support multi-column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638489#comment-15638489 ] zhengruifeng commented on SPARK-13677: -- It is shown in sklearn's doc here (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html ) that GBDT+LiR often bring high score than GBDT. > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14047) GBT improvement umbrella
[ https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638475#comment-15638475 ] zhengruifeng commented on SPARK-14047: -- I personally think SPARK-15581 may be a improvement about GBT. It is now a main method to use LoR/FM with GBT's Leaf transformation. > GBT improvement umbrella > > > Key: SPARK-14047 > URL: https://issues.apache.org/jira/browse/SPARK-14047 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley > > This is an umbrella for improvements to learning Gradient Boosted Trees: > GBTClassifier, GBTRegressor. > Note: Aspects of GBTs which are related to individual trees should be listed > under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14047) GBT improvement umbrella
[ https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638475#comment-15638475 ] zhengruifeng edited comment on SPARK-14047 at 11/5/16 3:21 AM: --- I personally think SPARK-13677 may be a improvement about GBT. It is now a main method to use LoR/FM with GBT's Leaf transformation. was (Author: podongfeng): I personally think SPARK-15581 may be a improvement about GBT. It is now a main method to use LoR/FM with GBT's Leaf transformation. > GBT improvement umbrella > > > Key: SPARK-14047 > URL: https://issues.apache.org/jira/browse/SPARK-14047 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley > > This is an umbrella for improvements to learning Gradient Boosted Trees: > GBTClassifier, GBTRegressor. > Note: Aspects of GBTs which are related to individual trees should be listed > under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638470#comment-15638470 ] Reynold Xin commented on SPARK-18258: - This makes sense. It's just extra information you want to be able to see what's going on. Can you sketch the API out and put a proposal in the ticket description? Doesn't need to be very well thought out. It will move the discussion forward. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638459#comment-15638459 ] zhengruifeng commented on SPARK-13677: -- Since mllib is in maintenance status. If this feature will be included, the corresponding PR will be updated to focus on ML only. > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13677) Support Tree-Based Feature Transformation for ML
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-13677: - Summary: Support Tree-Based Feature Transformation for ML (was: Support Tree-Based Feature Transformation for mllib) > Support Tree-Based Feature Transformation for ML > > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Description: The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant. The MiniBatch KMeans is named XMeans in following lines. {code} val path = "/tmp/mnist8m.scale" val data = MLUtils.loadLibSVMFile(sc, path) val vecs = data.map(_.features).persist() val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", seed=123l) km.computeCost(vecs) res0: Double = 3.317029898599564E8 val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) xm.computeCost(vecs) res1: Double = 3.3169865959604424E8 val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) xm2.computeCost(vecs) res2: Double = 3.317195831216454E8 {code} The above three training all reached the max number of iterations 10. We can see that the WSSSEs are almost the same. While their speed perfermence have significant difference: {code} KMeans2876sec MiniBatch KMeans (fraction=0.1) 263sec MiniBatch KMeans (fraction=0.01) 90sec {code} With appropriate fraction, the bigger the dataset is, the higher speedup is. The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) Comparison of the K-Means and MiniBatchKMeans on sklearn : http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py was: The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant. The MiniBatch KMeans is named XMeans in following lines. {code} val path = "/tmp/mnist8m.scale" val data = MLUtils.loadLibSVMFile(sc, path) val vecs = data.map(_.features).persist() val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", seed=123l) km.computeCost(vecs) res0: Double = 3.317029898599564E8 val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) xm.computeCost(vecs) res1: Double = 3.3169865959604424E8 val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) xm2.computeCost(vecs) res2: Double = 3.317195831216454E8 {code} The above three training all reached the max number of iterations 10. We can see that the WSSSEs are almost the same. While their speed perfermence have significant difference: {code} KMeans2876sec MiniBatch KMeans (fraction=0.1) 263sec MiniBatch KMeans (fraction=0.01) 90sec {code} With appropriate fraction, the bigger the dataset is, the higher speedup is. The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: zhengruifeng >Priority: Minor > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly
[jira] [Created] (SPARK-18284) Scheme of DataFrame generated from RDD is diffrent between master and 2.0
Kazuaki Ishizaki created SPARK-18284: Summary: Scheme of DataFrame generated from RDD is diffrent between master and 2.0 Key: SPARK-18284 URL: https://issues.apache.org/jira/browse/SPARK-18284 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki When the following program is executed, a schema of dataframe is different between master and branch 2.0. The result should be false. {code:java} val df = sparkContext.parallelize(1 to 8, 1).toDF() df.printSchema df.filter("value > 4").count === master === root |-- value: integer (nullable = true) === branch 2.0 === root |-- value: integer (nullable = false) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
[ https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-18256. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15756 [https://github.com/apache/spark/pull/15756] > Improve performance of event log replay in HistoryServer based on profiler > results > -- > > Key: SPARK-18256 > URL: https://issues.apache.org/jira/browse/SPARK-18256 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > Profiling event log replay in the HistoryServer reveals Json4S control flow > exceptions and `Utils.getFormattedClassName` calls as significant > bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties
[ https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638392#comment-15638392 ] Apache Spark commented on SPARK-17748: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/15779 > One-pass algorithm for linear regression with L1 and elastic-net penalties > -- > > Key: SPARK-17748 > URL: https://issues.apache.org/jira/browse/SPARK-17748 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > Fix For: 2.1.0 > > > Currently linear regression uses weighted least squares to solve the normal > equations locally on the driver when the dimensionality is small (<4096). > Weighted least squares uses a Cholesky decomposition to solve the problem > with L2 regularization (which has a closed-form solution). We can support > L1/elasticnet penalties by solving the equations locally using OWL-QN solver. > Also note that Cholesky does not handle singular covariance matrices, but > L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch > can also add support for solving singular covariance matrices by also adding > L-BFGS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638280#comment-15638280 ] Felix Cheung commented on SPARK-15581: -- This is a great next step if we could get more concrete on the roadmap. I would really like to call out R specifically though. I think there are several aspects we could do better: - Keeping up-to-date: There are many ML algorithms that are 3+ minor releases behind, how could we stay more up-to-date on new or changes and yet have stable enough Scala APIs to work with? Perhaps alternatively, how could we make R easier for ML contributors to work with? - Supporting advanced users: As of now ML APIs in R are fairly straightforward and are missing any capability to configure pipeline with extractor, feature transformer, or even evaluator. How could we make it more adaptable? > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.1.0 > > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose
[jira] [Commented] (SPARK-10523) SparkR formula syntax to turn strings/factors into numerics
[ https://issues.apache.org/jira/browse/SPARK-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638254#comment-15638254 ] Felix Cheung commented on SPARK-10523: -- Is this still an issue? As Yanbo says, we now support string label for classification and we have multiclass logistic regression in R. > SparkR formula syntax to turn strings/factors into numerics > --- > > Key: SPARK-10523 > URL: https://issues.apache.org/jira/browse/SPARK-10523 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Vincent Warmerdam > > In normal (non SparkR) R the formula syntax enables strings or factors to be > turned into dummy variables immediately when calling a classifier. This way, > the following R pattern is legal and often used: > {code} > library(magrittr) > df <- data.frame( class = c("a", "a", "b", "b"), i = c(1, 2, 5, 6)) > glm(class ~ i, family = "binomial", data = df) > {code} > The glm method will know that `class` is a string/factor and handles it > appropriately by casting it to a 0/1 array before applying any machine > learning. SparkR doesn't do this. > {code} > > ddf <- sqlContext %>% > createDataFrame(df) > > glm(class ~ i, family = "binomial", data = ddf) > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.IllegalArgumentException: Unsupported type for label: StringType > at > org.apache.spark.ml.feature.RFormulaModel.transformLabel(RFormula.scala:185) > at > org.apache.spark.ml.feature.RFormulaModel.transform(RFormula.scala:150) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:146) > at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:134) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) > at > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:134) > at > org.apache.spark.ml.api.r.SparkRWrappers$.fitRModelFormula(SparkRWrappers.scala:46) > at > org.apache.spark.ml.api.r.SparkRWrappers.fitRModelFormula(SparkRWrappers.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.refl > {code} > This can be fixed by doing a bit of manual labor. SparkR does accept booleans > as if they are integers here. > {code} > > ddf <- ddf %>% > withColumn("to_pred", .$class == "a") > > glm(to_pred ~ i, family = "binomial", data = ddf) > {code} > But this can become quite tedious, especially when you want to have models > that are using multiple classes that need classification. This is perhaps > less relevant for logistic regression (because it is a bit more like a > one-off classification approach) but it certainly is relevant if you would > want to use a formula for a randomforest and a column denotes, say, a type of > flower from the iris dataset. > Is there a good reason why this should not be a feature of formulas in Spark? > I am aware of issue 8774, which looks like it is adressing a similar theme > but a different issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18283) Add a test to make sure the default starting offset is latest
[ https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18283: Assignee: Tathagata Das (was: Apache Spark) > Add a test to make sure the default starting offset is latest > - > > Key: SPARK-18283 > URL: https://issues.apache.org/jira/browse/SPARK-18283 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18283) Add a test to make sure the default starting offset is latest
[ https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638228#comment-15638228 ] Apache Spark commented on SPARK-18283: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/15778 > Add a test to make sure the default starting offset is latest > - > > Key: SPARK-18283 > URL: https://issues.apache.org/jira/browse/SPARK-18283 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18283) Add a test to make sure the default starting offset is latest
[ https://issues.apache.org/jira/browse/SPARK-18283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18283: Assignee: Apache Spark (was: Tathagata Das) > Add a test to make sure the default starting offset is latest > - > > Key: SPARK-18283 > URL: https://issues.apache.org/jira/browse/SPARK-18283 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18283) Add a test to make sure the default starting offset is latest
Tathagata Das created SPARK-18283: - Summary: Add a test to make sure the default starting offset is latest Key: SPARK-18283 URL: https://issues.apache.org/jira/browse/SPARK-18283 Project: Spark Issue Type: Sub-task Components: Structured Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638190#comment-15638190 ] Apache Spark commented on SPARK-18282: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/15777 > Add model summaries for Python GMM and BisectingKMeans > -- > > Key: SPARK-18282 > URL: https://issues.apache.org/jira/browse/SPARK-18282 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > GaussianMixtureModel and BisectingKMeansModel in python do not have model > summaries, but they are implemented in Scala. We should add them for API > parity before 2.1 release. After the QA Jiras are created, this can be linked > as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18282: Assignee: Apache Spark > Add model summaries for Python GMM and BisectingKMeans > -- > > Key: SPARK-18282 > URL: https://issues.apache.org/jira/browse/SPARK-18282 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > GaussianMixtureModel and BisectingKMeansModel in python do not have model > summaries, but they are implemented in Scala. We should add them for API > parity before 2.1 release. After the QA Jiras are created, this can be linked > as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18282: Assignee: (was: Apache Spark) > Add model summaries for Python GMM and BisectingKMeans > -- > > Key: SPARK-18282 > URL: https://issues.apache.org/jira/browse/SPARK-18282 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Seth Hendrickson >Priority: Minor > > GaussianMixtureModel and BisectingKMeansModel in python do not have model > summaries, but they are implemented in Scala. We should add them for API > parity before 2.1 release. After the QA Jiras are created, this can be linked > as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18282) Add model summaries for Python GMM and BisectingKMeans
Seth Hendrickson created SPARK-18282: Summary: Add model summaries for Python GMM and BisectingKMeans Key: SPARK-18282 URL: https://issues.apache.org/jira/browse/SPARK-18282 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Seth Hendrickson Priority: Minor GaussianMixtureModel and BisectingKMeansModel in python do not have model summaries, but they are implemented in Scala. We should add them for API parity before 2.1 release. After the QA Jiras are created, this can be linked as a subtask. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17710) ReplSuite fails with ClassCircularityError in master Maven builds
[ https://issues.apache.org/jira/browse/SPARK-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15638114#comment-15638114 ] Apache Spark commented on SPARK-17710: -- User 'weiqingy' has created a pull request for this issue: https://github.com/apache/spark/pull/15776 > ReplSuite fails with ClassCircularityError in master Maven builds > - > > Key: SPARK-17710 > URL: https://issues.apache.org/jira/browse/SPARK-17710 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Weiqing Yang >Priority: Critical > Fix For: 2.1.0 > > > The master Maven build is currently broken because ReplSuite consistently > fails with ClassCircularityErrors. See > https://spark-tests.appspot.com/jobs/spark-master-test-maven-hadoop-2.3 for a > timeline of the failure. > Here's the first build where this failed: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.3/2000/ > This appears to correspond to > https://github.com/apache/spark/commit/6a68c5d7b4eb07e4ed6b702dd1536cd08d9bba7d > The same tests pass in the SBT build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Miner updated SPARK-18281: --- Description: I run the example straight out of the api docs for toLocalIterator and it gives a time out exception: {code} from pyspark import SparkContext sc = SparkContext() rdd = sc.parallelize(range(10)) [x for x in rdd.toLocalIterator()] {code} conf file: spark.driver.maxResultSize 6G spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 16G spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.buffer.dir /raid0/spark spark.hadoop.fs.s3n.buffer.dir /raid0/spark spark.hadoop.fs.s3a.connection.timeout 50 spark.hadoop.fs.s3n.multipart.uploads.enabled true spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.parquet.block.size 2147483648 spark.hadoop.parquet.enable.summary-metadatafalse spark.jars.packages com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 spark.local.dir /raid0/spark spark.mesos.coarse false spark.mesos.constraints priority:1 spark.network.timeout 600 spark.rpc.message.maxSize500 spark.speculation false spark.sql.parquet.mergeSchema false spark.sql.planner.externalSort true spark.submit.deployMode client spark.task.cpus 1 Exception here: {code} --- timeout Traceback (most recent call last) in () 2 sc = SparkContext() 3 rdd = sc.parallelize(range(10)) > 4 [x for x in rdd.toLocalIterator()] /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in _load_from_socket(port, serializer) 140 try: 141 rf = sock.makefile("rb", 65536) --> 142 for item in serializer.load_stream(rf): 143 yield item 144 finally: /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in load_stream(self, stream) 137 while True: 138 try: --> 139 yield self._read_with_length(stream) 140 except EOFError: 141 return /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in _read_with_length(self, stream) 154 155 def _read_with_length(self, stream): --> 156 length = read_int(stream) 157 if length == SpecialLengths.END_OF_DATA_SECTION: 158 raise EOFError /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in read_int(stream) 541 542 def read_int(stream): --> 543 length = stream.read(4) 544 if not length: 545 raise EOFError /usr/lib/python2.7/socket.pyc in read(self, size) 378 # fragmentation issues on many platforms. 379 try: --> 380 data = self._sock.recv(left) 381 except error, e: 382 if e.args[0] == EINTR: timeout: timed out {code} was: I run the example straight out of the api docs for {code}toLocalIterator{code} and it gives a time out exception: {code} from pyspark import SparkContext sc = SparkContext() rdd = sc.parallelize(range(10)) [x for x in rdd.toLocalIterator()] {code} conf file: spark.driver.maxResultSize 6G spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 16G spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.buffer.dir /raid0/spark spark.hadoop.fs.s3n.buffer.dir /raid0/spark spark.hadoop.fs.s3a.connection.timeout 50 spark.hadoop.fs.s3n.multipart.uploads.enabled true spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.parquet.block.size 2147483648 spark.hadoop.parquet.enable.summary-metadatafalse spark.jars.packages com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 spark.local.dir /raid0/spark spark.mesos.coarse false spark.mesos.constraints priority:1 spark.network.timeout 600 spark.rpc.message.maxSize500 spark.speculation false spark.sql.parquet.mergeSchema false spark.sql.planner.externalSort true spark.submit.deployMode client spark.task.cpus 1 Exception here: {code} --- timeout Traceback (most recent call last) in () 2 sc = SparkContext() 3 rdd = sc.parallelize(range(10)) > 4 [x for x in rdd.toLocalIterator()] /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in _load_from_socket(port, serializer) 140 try: 141 rf = sock.makefile("rb", 65536) --> 142 for item in serializer.load_stream(rf): 143 yield item 144
[jira] [Created] (SPARK-18281) toLocalIterator yields time out error on pyspark2
Luke Miner created SPARK-18281: -- Summary: toLocalIterator yields time out error on pyspark2 Key: SPARK-18281 URL: https://issues.apache.org/jira/browse/SPARK-18281 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.1 Environment: Ubuntu 14.04.5 LTS Driver: AWS M4.XLARGE Slaves: AWS M4.4.XLARGE mesos 1.0.1 spark 2.0.1 pyspark Reporter: Luke Miner I run the example straight out of the api docs for {code}toLocalIterator{code} and it gives a time out exception: {code} from pyspark import SparkContext sc = SparkContext() rdd = sc.parallelize(range(10)) [x for x in rdd.toLocalIterator()] {code} conf file: spark.driver.maxResultSize 6G spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError spark.executor.memory 16G spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem spark.hadoop.fs.s3a.buffer.dir /raid0/spark spark.hadoop.fs.s3n.buffer.dir /raid0/spark spark.hadoop.fs.s3a.connection.timeout 50 spark.hadoop.fs.s3n.multipart.uploads.enabled true spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.parquet.block.size 2147483648 spark.hadoop.parquet.enable.summary-metadatafalse spark.jars.packages com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 spark.local.dir /raid0/spark spark.mesos.coarse false spark.mesos.constraints priority:1 spark.network.timeout 600 spark.rpc.message.maxSize500 spark.speculation false spark.sql.parquet.mergeSchema false spark.sql.planner.externalSort true spark.submit.deployMode client spark.task.cpus 1 Exception here: {code} --- timeout Traceback (most recent call last) in () 2 sc = SparkContext() 3 rdd = sc.parallelize(range(10)) > 4 [x for x in rdd.toLocalIterator()] /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in _load_from_socket(port, serializer) 140 try: 141 rf = sock.makefile("rb", 65536) --> 142 for item in serializer.load_stream(rf): 143 yield item 144 finally: /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in load_stream(self, stream) 137 while True: 138 try: --> 139 yield self._read_with_length(stream) 140 except EOFError: 141 return /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in _read_with_length(self, stream) 154 155 def _read_with_length(self, stream): --> 156 length = read_int(stream) 157 if length == SpecialLengths.END_OF_DATA_SECTION: 158 raise EOFError /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in read_int(stream) 541 542 def read_int(stream): --> 543 length = stream.read(4) 544 if not length: 545 raise EOFError /usr/lib/python2.7/socket.pyc in read(self, size) 378 # fragmentation issues on many platforms. 379 try: --> 380 data = self._sock.recv(left) 381 except error, e: 382 if e.args[0] == EINTR: timeout: timed out {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16804: Fix Version/s: 2.0.2 > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Nattavut Sutyanyong > Fix For: 2.0.2, 2.1.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17337: Fix Version/s: 2.0.2 > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Herman van Hovell > Labels: correctness > Fix For: 2.0.2, 2.1.0 > > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16804) Correlated subqueries containing non-deterministic operators return incorrect results
[ https://issues.apache.org/jira/browse/SPARK-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637905#comment-15637905 ] Apache Spark commented on SPARK-16804: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15772 > Correlated subqueries containing non-deterministic operators return incorrect > results > - > > Key: SPARK-16804 > URL: https://issues.apache.org/jira/browse/SPARK-16804 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Nattavut Sutyanyong > Fix For: 2.1.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Correlated subqueries with LIMIT could return incorrect results. The rule > ResolveSubquery in the Analysis phase moves correlated predicates to a join > predicates and neglect the semantic of the LIMIT. > Example: > {noformat} > Seq(1, 2).toDF("c1").createOrReplaceTempView("t1") > Seq(1, 2).toDF("c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where exists (select 1 from t2 where t1.c1=t2.c2 LIMIT > 1)").show > +---+ > > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result contains both rows from T1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
[ https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18280: Assignee: (was: Apache Spark) > Potential deadlock in `StandaloneSchedulerBackend.dead` > --- > > Key: SPARK-18280 > URL: https://issues.apache.org/jira/browse/SPARK-18280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Shixiong Zhu > > "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not > call "SparkContext.stop" in the same thread. "SparkContext.stop" will block > until all RPC threads exit, if it's called inside a RPC thread, it will be > dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
[ https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637873#comment-15637873 ] Apache Spark commented on SPARK-18280: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15775 > Potential deadlock in `StandaloneSchedulerBackend.dead` > --- > > Key: SPARK-18280 > URL: https://issues.apache.org/jira/browse/SPARK-18280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Shixiong Zhu > > "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not > call "SparkContext.stop" in the same thread. "SparkContext.stop" will block > until all RPC threads exit, if it's called inside a RPC thread, it will be > dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
[ https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18280: Assignee: Apache Spark > Potential deadlock in `StandaloneSchedulerBackend.dead` > --- > > Key: SPARK-18280 > URL: https://issues.apache.org/jira/browse/SPARK-18280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not > call "SparkContext.stop" in the same thread. "SparkContext.stop" will block > until all RPC threads exit, if it's called inside a RPC thread, it will be > dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
Shixiong Zhu created SPARK-18280: Summary: Potential deadlock in `StandaloneSchedulerBackend.dead` Key: SPARK-18280 URL: https://issues.apache.org/jira/browse/SPARK-18280 Project: Spark Issue Type: Bug Reporter: Shixiong Zhu "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not call "SparkContext.stop" in the same thread. "SparkContext.stop" will block until all RPC threads exit, if it's called inside a RPC thread, it will be dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18280) Potential deadlock in `StandaloneSchedulerBackend.dead`
[ https://issues.apache.org/jira/browse/SPARK-18280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18280: - Affects Version/s: 1.6.2 2.0.0 2.0.1 > Potential deadlock in `StandaloneSchedulerBackend.dead` > --- > > Key: SPARK-18280 > URL: https://issues.apache.org/jira/browse/SPARK-18280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Shixiong Zhu > > "StandaloneSchedulerBackend.dead" is called in a RPC thread, so it should not > call "SparkContext.stop" in the same thread. "SparkContext.stop" will block > until all RPC threads exit, if it's called inside a RPC thread, it will be > dead-lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18189) task not serializable with groupByKey() + mapGroups() + map
[ https://issues.apache.org/jira/browse/SPARK-18189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637824#comment-15637824 ] Apache Spark commented on SPARK-18189: -- User 'seyfe' has created a pull request for this issue: https://github.com/apache/spark/pull/15774 > task not serializable with groupByKey() + mapGroups() + map > --- > > Key: SPARK-18189 > URL: https://issues.apache.org/jira/browse/SPARK-18189 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yang Yang >Assignee: Ergin Seyfe > Fix For: 2.0.2, 2.1.0 > > > just run the following code > {code} > val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4.as[(Int,Int)] > val grouped = a.groupByKey({x:(Int,Int)=>x._1}) > val mappedGroups = grouped.mapGroups((k,x)=>{(k,1)}) > val yyy = sc.broadcast(1) > val last = mappedGroups.rdd.map(xx=>{ > val simpley = yyy.value > > 1 > }) > {code} > spark says Task not serializable -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18279) ML programming guide should have R examples
[ https://issues.apache.org/jira/browse/SPARK-18279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18279: - Affects Version/s: 2.1.0 Target Version/s: 2.1.0 > ML programming guide should have R examples > --- > > Key: SPARK-18279 > URL: https://issues.apache.org/jira/browse/SPARK-18279 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > http://spark.apache.org/docs/latest/ml-classification-regression.html > for example, should have the R tab with examples. > This should be done on all ML that has the R API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18279) ML programming guide should have R examples
[ https://issues.apache.org/jira/browse/SPARK-18279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18279: - Description: http://spark.apache.org/docs/latest/ml-classification-regression.html for example, should have the R tab with examples, just like Scala/Java/Python. This should be done on all ML that has the R API. was: http://spark.apache.org/docs/latest/ml-classification-regression.html for example, should have the R tab with examples. This should be done on all ML that has the R API. > ML programming guide should have R examples > --- > > Key: SPARK-18279 > URL: https://issues.apache.org/jira/browse/SPARK-18279 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > http://spark.apache.org/docs/latest/ml-classification-regression.html > for example, should have the R tab with examples, just like Scala/Java/Python. > This should be done on all ML that has the R API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18279) ML programming guide should have R examples
Felix Cheung created SPARK-18279: Summary: ML programming guide should have R examples Key: SPARK-18279 URL: https://issues.apache.org/jira/browse/SPARK-18279 Project: Spark Issue Type: Bug Components: ML, SparkR Reporter: Felix Cheung http://spark.apache.org/docs/latest/ml-classification-regression.html for example, should have the R tab with examples. This should be done on all ML that has the R API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release
[ https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637756#comment-15637756 ] Felix Cheung commented on SPARK-18266: -- Actually, I just realize the ML programming guide (not just the SparkR programming guide) should also talk about R APIs. Opened https://issues.apache.org/jira/browse/SPARK-18279 > Update R vignettes and programming guide for 2.1.0 release > -- > > Key: SPARK-18266 > URL: https://issues.apache.org/jira/browse/SPARK-18266 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar closed SPARK-18273. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Bhatnagar resolved SPARK-18273. -- Resolution: Not A Problem Glob patterns can be passed instead of full paths to reduce the numbers of paths passed in to load method. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637731#comment-15637731 ] Aniket Bhatnagar commented on SPARK-18273: -- Thanks [~srowen]. Didn't realize that I could actually pass glob pattern. Thank you so much. > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637713#comment-15637713 ] Nicholas Chammas commented on SPARK-18277: -- {quote} If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the appropriate protocol for {{is}}. {quote} Ah my bad, in this case the appropriate thing to do is {{when(col('a.b').isNull(), '')}}. So there is a workaround available today via {{when()}} and {{isNull()}}. > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.
[ https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18276: Assignee: Apache Spark > Some ML training summaries are not copied when {{copy()}} is called. > > > Key: SPARK-18276 > URL: https://issues.apache.org/jira/browse/SPARK-18276 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression > models do not copy their training summaries inside the {{copy}} method. In > contrast, Linear/Logistic regression models do. They should all be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.
[ https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637671#comment-15637671 ] Apache Spark commented on SPARK-18276: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/15773 > Some ML training summaries are not copied when {{copy()}} is called. > > > Key: SPARK-18276 > URL: https://issues.apache.org/jira/browse/SPARK-18276 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression > models do not copy their training summaries inside the {{copy}} method. In > contrast, Linear/Logistic regression models do. They should all be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.
[ https://issues.apache.org/jira/browse/SPARK-18276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18276: Assignee: (was: Apache Spark) > Some ML training summaries are not copied when {{copy()}} is called. > > > Key: SPARK-18276 > URL: https://issues.apache.org/jira/browse/SPARK-18276 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression > models do not copy their training summaries inside the {{copy}} method. In > contrast, Linear/Logistic regression models do. They should all be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637654#comment-15637654 ] Nicholas Chammas commented on SPARK-18277: -- Thanks for the pointer. I'll follow the discussion there. > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-18278: --- External issue URL: https://github.com/kubernetes/kubernetes/issues/34377 External issue ID: #34377 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Erik Erlandson > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637621#comment-15637621 ] Cody Koeninger commented on SPARK-18258: So one obvious one is that if wherever checkpoint data is being stored fails or is corrupted, my downstream database can still be fine and have correct results, yet I have no way of restarting the job from a known point because the batch id stored in the database is now meaningless. Basically, I do not want to introduce another N points of failure in between Kafka and my data store. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637622#comment-15637622 ] Michael Armbrust commented on SPARK-18277: -- We've been talking about better support for nested data for 2.2, [SPARK-16483]. > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
[ https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14387: -- Target Version/s: (was: 2.0.2) > Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc > - > > Key: SPARK-14387 > URL: https://issues.apache.org/jira/browse/SPARK-14387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Rajesh Balamohan > > In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB > scale. Initially I got the following exception (as FileScanRDD has been made > the default in master branch) > {noformat} > 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. > java.lang.IllegalArgumentException: Field "s_store_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157) > at > org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361) > {noformat} > When running with "spark.sql.sources.fileScan=false", following exception is > thrown > {noformat} > 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:59) > at > org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:94) > at > org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410) > at > org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317) > at > org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124) > at >
[jira] [Comment Edited] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release
[ https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637604#comment-15637604 ] Felix Cheung edited comment on SPARK-18266 at 11/4/16 8:38 PM: --- I'm not sure it is, actually. If I recall there shouldn't be a lot of changes to the SparkR programming guide for this release - probably just add to the list of ML. For vignettes, we need to add to the list the new ML functionality we are adding (including explanation, examples and so on). In fact spark.logit is one of the new thing we need to document. You are welcome to take this on or split this up to start on this. was (Author: felixcheung): I'm not sure it is, actually. If I recall there shouldn't be a lot of (if any?) changes to the SparkR programming guide for this release. For vignettes, we need to add to the list the new ML functionality we are adding (including explanation, examples and so on). In fact spark.logit is one of the new thing we need to document. You are welcome to take this on or split this up to start on this. > Update R vignettes and programming guide for 2.1.0 release > -- > > Key: SPARK-18266 > URL: https://issues.apache.org/jira/browse/SPARK-18266 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14241: -- Assignee: Cheng Lian > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer >Assignee: Cheng Lian > Fix For: 2.0.0 > > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release
[ https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637604#comment-15637604 ] Felix Cheung commented on SPARK-18266: -- I'm not sure it is, actually. If I recall there shouldn't be a lot of (if any?) changes to the SparkR programming guide for this release. For vignettes, we need to add to the list the new ML functionality we are adding (including explanation, examples and so on). In fact spark.logit is one of the new thing we need to document. You are welcome to take this on or split this up to start on this. > Update R vignettes and programming guide for 2.1.0 release > -- > > Key: SPARK-18266 > URL: https://issues.apache.org/jira/browse/SPARK-18266 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-17337: - Assignee: Herman van Hovell > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Herman van Hovell > Labels: correctness > Fix For: 2.1.0 > > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637600#comment-15637600 ] Apache Spark commented on SPARK-17337: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15772 > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > Fix For: 2.1.0 > > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637602#comment-15637602 ] Anirudh Ramanathan commented on SPARK-18278: Corresponding issue in kubernetes: https://github.com/kubernetes/kubernetes/issues/34377 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Erik Erlandson > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17337. --- Resolution: Fixed Fix Version/s: 2.1.0 > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Herman van Hovell > Labels: correctness > Fix For: 2.1.0 > > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637595#comment-15637595 ] Michael Armbrust commented on SPARK-18258: -- What sort of failures are you anticipating here? > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637591#comment-15637591 ] Erik Erlandson commented on SPARK-18278: Current prototype: https://github.com/foxish/spark/tree/k8s-support https://github.com/foxish/spark/pull/1 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Erik Erlandson > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637576#comment-15637576 ] Cody Koeninger commented on SPARK-18258: The sink doesn't have to reason about equality of the representations. It just has to be able to store those representations, in addition the batch id if necessary, so that the job can be recovered if spark fails in a way that renders the batch id meaningless or the user wants to switch to a different streaming system. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566 ] Nicholas Chammas edited comment on SPARK-18277 at 11/4/16 8:25 PM: --- [~marmbrus] / [~yhuai]: Is there a workaround for this available today? Also, do you think {{fill()}} should fit this use case down the line? was (Author: nchammas): [~marmbrus] / [~yhuai]: Is there is workaround for this available today? Also, do you think {{fill()}} should fit this use case down the line? > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
Erik Erlandson created SPARK-18278: -- Summary: Support native submission of spark jobs to a kubernetes cluster Key: SPARK-18278 URL: https://issues.apache.org/jira/browse/SPARK-18278 Project: Spark Issue Type: Umbrella Components: Build, Deploy, Documentation, Scheduler, Spark Core Affects Versions: 2.2.0 Reporter: Erik Erlandson A new Apache Spark sub-project that enables native support for submitting Spark applications to a kubernetes cluster. The submitted application runs in a driver executing on a kubernetes pod, and executors lifecycles are also managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18277: - Description: It appears that you cannot use {{fill()}} and friends to quickly modify struct fields. For example: {code} >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), >>> Row(a=Row(b=None), c=None)]) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: string (nullable = true) |-- c: string (nullable = true) >>> df.show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| null| +---+---+ >>> df.na.fill('').show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| | +---+---+ {code} {{c}} got filled in, but {{a.b}} didn't. I don't know if it's "appropriate", but it would be nice if {{fill()}} and friends worked automatically on struct fields. As things are today, there doesn't appear to be a way to fill in null values inside structs. If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, '')}} it doesn't catch the null values. was: It appears that you cannot use {{fill()}} and friends to quickly modify struct fields. For example: {code} >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), >>> Row(a=Row(b=None), c=None)]) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: string (nullable = true) |-- c: string (nullable = true) >>> df.show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| null| +---+---+ >>> df.na.fill('').show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| | +---+---+ {code} {{c}} got filled in, but {{a.b}} didn't. I don't know if it's "appropriate", but it would be nice if {{fill()}} and friends worked automatically on struct fields. As things are today, there doesn't appear to be a way to fill in null values inside structs. If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the appropriate protocol for {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the null values. > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. If you try {{when()}}, you realize that you cannot do > {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the > appropriate protocol for {{is}}. And if you try {{when(col('a.b') == None, > '')}} it doesn't catch the null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637566#comment-15637566 ] Nicholas Chammas commented on SPARK-18277: -- [~marmbrus] / [~yhuai]: Is there is workaround for this available today? Also, do you think {{fill()}} should fit this use case down the line? > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. > If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is > None, '')}} because {{Column}} doesn't implement the appropriate protocol for > {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the > null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18277) na.fill() and friends should work on struct fields
[ https://issues.apache.org/jira/browse/SPARK-18277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18277: - Description: It appears that you cannot use {{fill()}} and friends to quickly modify struct fields. For example: {code} >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), >>> Row(a=Row(b=None), c=None)]) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: string (nullable = true) |-- c: string (nullable = true) >>> df.show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| null| +---+---+ >>> df.na.fill('').show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| | +---+---+ {code} {{c}} got filled in, but {{a.b}} didn't. I don't know if it's "appropriate", but it would be nice if {{fill()}} and friends worked automatically on struct fields. As things are today, there doesn't appear to be a way to fill in null values inside structs. If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is None, '')}} because {{Column}} doesn't implement the appropriate protocol for {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the null values. was: It appears that you cannot use {{fill()}} and friends to quickly modify struct fields. For example: {code} >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), >>> Row(a=Row(b=None), c=None)]) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: string (nullable = true) |-- c: string (nullable = true) >>> df.show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| null| +---+---+ >>> df.na.fill('').show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| | +---+---+ {code} {{c}} got filled in, but {{a.b}} didn't. I don't know if it's "appropriate", but it would be nice if {{fill()}} and friends worked automatically on struct fields. As things are today, it appears that you have to manually unpack and rebuild structs to do things like fill in missing field values. > na.fill() and friends should work on struct fields > -- > > Key: SPARK-18277 > URL: https://issues.apache.org/jira/browse/SPARK-18277 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nicholas Chammas >Priority: Minor > > It appears that you cannot use {{fill()}} and friends to quickly modify > struct fields. > For example: > {code} > >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), > >>> Row(a=Row(b=None), c=None)]) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: string (nullable = true) > |-- c: string (nullable = true) > >>> df.show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| null| > +---+---+ > >>> df.na.fill('').show() > +---+---+ > | a| c| > +---+---+ > |[yeah yeah]|alright| > | [null]| | > +---+---+ > {code} > {{c}} got filled in, but {{a.b}} didn't. > I don't know if it's "appropriate", but it would be nice if {{fill()}} and > friends worked automatically on struct fields. > As things are today, there doesn't appear to be a way to fill in null values > inside structs. > If you try {{when()}}, you realize that you cannot do {{when(col('a.b') is > None, '')}} because {{Column}} doesn't implement the appropriate protocol for > {{is}}, and if you try {{when(col('a.b') == None, '')}} it doesn't catch the > null values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637493#comment-15637493 ] Yun Ni commented on SPARK-18081: This is super helpful. Thanks! > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18277) na.fill() and friends should work on struct fields
Nicholas Chammas created SPARK-18277: Summary: na.fill() and friends should work on struct fields Key: SPARK-18277 URL: https://issues.apache.org/jira/browse/SPARK-18277 Project: Spark Issue Type: Improvement Components: SQL Reporter: Nicholas Chammas Priority: Minor It appears that you cannot use {{fill()}} and friends to quickly modify struct fields. For example: {code} >>> df = spark.createDataFrame([Row(a=Row(b='yeah yeah'), c='alright'), >>> Row(a=Row(b=None), c=None)]) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: string (nullable = true) |-- c: string (nullable = true) >>> df.show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| null| +---+---+ >>> df.na.fill('').show() +---+---+ | a| c| +---+---+ |[yeah yeah]|alright| | [null]| | +---+---+ {code} {{c}} got filled in, but {{a.b}} didn't. I don't know if it's "appropriate", but it would be nice if {{fill()}} and friends worked automatically on struct fields. As things are today, it appears that you have to manually unpack and rebuild structs to do things like fill in missing field values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637491#comment-15637491 ] Michael Armbrust commented on SPARK-18258: -- I agree that we don't want to lock people in, which is why a goal of [SPARK-17829] was to make the offset representation user readable. Given that they are accesible in this way (we should probably document this, and make it a long term contract), I don't think that we want to widen the Sink API, to expose the internal details of the various sources. Exposing more than we need to leaks details and could lead to a more brittle system. This is why I think its safer to use {{batchId}} as a proxy to achieve transactional semantics. Consider the case where some source returns the offsets: {{a: 1, b: 2}} but upon recovery it returns {{b: 2, a: 1}}. This is a little weird, but as long as they implement {{getBatch}} correctly, there are no correctness issues in this Source. However, with this proposal, the sink is now responsible for reasoning about equality of these representations. In contrast, its trivial to reason about equality in the current API. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18276) Some ML training summaries are not copied when {{copy()}} is called.
Seth Hendrickson created SPARK-18276: Summary: Some ML training summaries are not copied when {{copy()}} is called. Key: SPARK-18276 URL: https://issues.apache.org/jira/browse/SPARK-18276 Project: Spark Issue Type: Improvement Components: ML Reporter: Seth Hendrickson Priority: Minor GaussianMixture, KMeans, BisectingKMeans, and GeneralizedLinearRegression models do not copy their training summaries inside the {{copy}} method. In contrast, Linear/Logistic regression models do. They should all be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18258: - Target Version/s: 2.2.0 > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637424#comment-15637424 ] Seth Hendrickson commented on SPARK-18081: -- No worries, just wanted to check in to see if you had bandwidth to do it. You can get a preview of the user guide by building the docs with jekyll {{SKIP_API=1 jekyll build}} inside the docs directory. For more detail, please see [the readme|https://github.com/apache/spark/tree/master/docs] > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18197) Optimise AppendOnlyMap implementation
[ https://issues.apache.org/jira/browse/SPARK-18197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18197. - Resolution: Fixed Assignee: Adam Roberts Fix Version/s: 2.1.0 > Optimise AppendOnlyMap implementation > - > > Key: SPARK-18197 > URL: https://issues.apache.org/jira/browse/SPARK-18197 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.1 >Reporter: Adam Roberts >Assignee: Adam Roberts >Priority: Minor > Fix For: 2.1.0 > > > This improvement works by using the cheapest comparison test first and we > observed a 1% performance on PageRank (HiBench large) with this change > tprof output before the change follows, AppendOnlyMap.changeValue is where > the optimisation occurs: > {code} > PID 256059 22.86java_1fa29 > MOD 81337 7.26 JITCODE > SYM 11250 1.00 > java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V_7fe098983af4 > SYM 8053 0.72 > org/apache/spark/util/collection/AppendOnlyMap.changeValue(Ljava/lang/Object;Lscala/Function2;)Ljava/lang/Object;_7fe098c211e8 > SYM 5175 0.46 > java/lang/String.equals(Ljava/lang/Object;)Z_7fe0989eb2e8 > SYM 3616 0.32 > org/apache/spark/util/SizeEstimator$.estimate(Ljava/lang/Object;Ljava/util/IdentityHashMap;)J_7fe098bc35a8 > SYM 3235 0.29 > org/apache/spark/util/collection/ExternalSorter$$anonfun$4$$anon$6.compare(Ljava/lang/Object;Ljava/lang/Object;)I_7fe098c855a8 > SYM 3182 0.28 > java/io/ObjectInputStream$BlockDataInputStream.readUTFBody(J)Ljava/lang/String;_7fe098980ec8 > SYM 3111 0.28 > org/apache/spark/util/SizeEstimator$SearchState.enqueue(Ljava/lang/Object;)V_7fe0989f2920 > {code} > tprof after the change > {code} > MOD 56804 5.07 JITCODE > SYM 8766 0.78 > java/io/ObjectOutputStream.writeObject0(Ljava/lang/Object;Z)V_7f0088bb2034 > SYM 5746 0.51 > java/io/ObjectStreamClass.lookup(Ljava/lang/Class;Z)Ljava/io/ObjectStreamClass;_7f0088944ae8 > SYM 3378 0.30 > java/io/ObjectInputStream.readObject0(Z)Ljava/lang/Object;_7f0088c5ed00 > SYM 3121 0.28 > java/io/ObjectInputStream$BlockDataInputStream.readUTFBody(J)Ljava/lang/String;_7f0088c6de08 > SYM 2857 0.26 > org/apache/spark/storage/BufferReleasingInputStream.read([BII)I_7f0088b3e3a8 > SYM 2786 0.25 > org/apache/spark/util/collection/AppendOnlyMap.changeValue(Ljava/lang/Object;Lscala/Function2;)Ljava/lang/Object;_7f008899b048 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18260: Assignee: Apache Spark > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Assignee: Apache Spark >Priority: Blocker > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637290#comment-15637290 ] Apache Spark commented on SPARK-18260: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15771 > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Priority: Blocker > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18260: Assignee: (was: Apache Spark) > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Priority: Blocker > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18273) DataFrameReader.load takes a lot of time to start the job if a lot of file/dir paths are pass
[ https://issues.apache.org/jira/browse/SPARK-18273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18273: -- Priority: Minor (was: Major) I'm not sure it's worth the complexity. How about passing a glob parameter? > DataFrameReader.load takes a lot of time to start the job if a lot of > file/dir paths are pass > -- > > Key: SPARK-18273 > URL: https://issues.apache.org/jira/browse/SPARK-18273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Aniket Bhatnagar >Priority: Minor > > If the paths Seq parameter contains a lot of elements, then > DataFrameReader.load takes a lot of time starting the job as it attempts to > check if each of the path exists using fs.exists. There should be a boolean > configuration option to disable the checking for path's existence and that > should be passed in as parameter to DataSource.resolveRelation call. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18266) Update R vignettes and programming guide for 2.1.0 release
[ https://issues.apache.org/jira/browse/SPARK-18266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637204#comment-15637204 ] Miao Wang commented on SPARK-18266: --- [~felixcheung] Is this an umbrella JIRA? > Update R vignettes and programming guide for 2.1.0 release > -- > > Key: SPARK-18266 > URL: https://issues.apache.org/jira/browse/SPARK-18266 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java
[ https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18193. --- Resolution: Not A Problem I looked into this further and found that it does work to add RDDs after it's started in Scala, but not Java. This is because the Java implementation makes a copy of the queue. This is consistent with the docs and examples actually. So I'm re-closing as not a problem; I do not believe this should be reopened. > queueStream not updated if rddQueue.add after create queueStream in Java > > > Key: SPARK-18193 > URL: https://issues.apache.org/jira/browse/SPARK-18193 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Hubert Kang > > Within > examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java, > no any data is deteceted if below code to put something to rddQueue is > executed after queueStream is created (line 65). > for (int i = 0; i < 30; i++) { > rddQueue.add(ssc.sparkContext().parallelize(list)); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18065. --- Resolution: Not A Problem > Spark 2 allows filter/where on columns not in current schema > > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Matthew Scruggs >Priority: Minor > > I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a > DataFrame that previously had a column, but no longer has it in its schema > due to a select() operation. > In Spark 1.6.2, in spark-shell, we see that an exception is thrown when > attempting to filter/where using the selected-out column: > {code:title=Spark 1.6.2} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), > (2, "two".selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input > columns: [id]; > {code} > However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds > (no AnalysisException) and seems to filter out data as if the column remains: > {code:title=Spark 2.0.1} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.1 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df1 = sc.parallelize(Seq((1, "one"), (2, > "two"))).toDF().selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > +---+ > | id| > +---+ > | 1| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17969. --- Resolution: Won't Fix > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17945) Writing to S3 should allow setting object metadata
[ https://issues.apache.org/jira/browse/SPARK-17945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17945. --- Resolution: Won't Fix > Writing to S3 should allow setting object metadata > -- > > Key: SPARK-17945 > URL: https://issues.apache.org/jira/browse/SPARK-17945 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: Jeff Schobelock >Priority: Minor > > I can't find any possible way to use Spark to write to S3 and set user object > metadata. This seems like such a simple thing that I feel I must be missing > somewhere how to do itbut I have yet to find anything. > I don't know what all work adding this would entail. My idea would be that > there is something like: > rdd.saveAsTextFile(s3://testbucket/file).withMetadata(Map> data). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637189#comment-15637189 ] Miao Wang commented on SPARK-15784: --- I created a new PR to implement PIC as a Transformer. > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637186#comment-15637186 ] Nattavut Sutyanyong commented on SPARK-17337: - Totally agreed on your approach. We should close off any potential incorrect results at the soonest possible. Thanks for the advise on the approach of different/competing PRs. I am relatively new to the community and try to be careful not to break any etiquette of the community. > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637167#comment-15637167 ] Herman van Hovell commented on SPARK-17337: --- [~nsyca] You are right to say that this is part of a larger set of problems which we should address and which given the amount of attempts made to fix this proves that it it non-trivial. It is not a problem to have different (competing) PRs for this problem, this usually leads to solutions. So please submit a PR is feel that this is a better solution. However, for this ticket I just want to solve the correctness issue at hand. > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637132#comment-15637132 ] Apache Spark commented on SPARK-15784: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/15770 > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-18128: Issue Type: Sub-task (was: Improvement) Parent: SPARK-18267 > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637110#comment-15637110 ] holdenk commented on SPARK-18128: - When I e-mailed [~prabinb] earlier this week I got an out of office auto response, so if we don't here back we should maybe ping again next week :) > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637111#comment-15637111 ] holdenk commented on SPARK-18128: - Sure > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637102#comment-15637102 ] Yun Ni commented on SPARK-18081: Sorry, I was really overloaded this week. I will try my best to send a PR before EoW. BTW, when I was writing ml-lsh.md, I could not find a place to get a preview. Do you have any guidelines for preview the Spark docs? > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637104#comment-15637104 ] holdenk commented on SPARK-18128: - Good call - so publishing to PyPI test has worked fine but there might be a different production limit and or we might be close to some file size limit that we will hit in the future and it would be better to have an exemption in place first. > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-17337: -- Labels: correctness (was: ) > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637099#comment-15637099 ] Herman van Hovell commented on SPARK-17348: --- Yeah, it would be nice if you can consolidate the two, and put your current fix there as well. > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18275) Why does not use an ordered queue in takeOrdered?
[ https://issues.apache.org/jira/browse/SPARK-18275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18275. --- Resolution: Not A Problem Questions should go to user@ Priority queues are not necessarily sorted and in fact are generally not sorted because they use a heap representation internally. It's not true that inserts are O(log k) if you are maintaining a sorted data structure. > Why does not use an ordered queue in takeOrdered? > - > > Key: SPARK-18275 > URL: https://issues.apache.org/jira/browse/SPARK-18275 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.0.1 >Reporter: xubo245 >Priority: Minor > > Every partition in mapRDDs is defined as BoundedPriorityQueue object : > val queue = new BoundedPriorityQueue[T](num)(ord.reverse) > in org.apache.spark.rdd.RDD#takeOrdered > After mapRDDs.reduce,only return one queue and the queue is > BoundedPriorityQueue ,so after toArray , Is it necessary to use sorted in > takeOrdered? If we can keep the queue is ordered,we can only use reverse > The same as in org.apache.spark.util.collection.Utils#takeOrdered, the > leastOf method also use a unordered buffer ,why does not use a ordered queue? > we can insert a num in O(log k) ,but the traditional quickselect algorithm > take O(k) time. Also we do not need a sort after selecting and save O(k * > log k) > the leastOf is call > com.google.common.collect.Ordering#leastOf(java.util.Iterator, int) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18128) Add support for publishing to PyPI
[ https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636944#comment-15636944 ] Nicholas Chammas commented on SPARK-18128: -- For the record: Let's also check with the PyPI admins if we need some kind of exemption on any file size limits. > Add support for publishing to PyPI > -- > > Key: SPARK-18128 > URL: https://issues.apache.org/jira/browse/SPARK-18128 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk > > After SPARK-1267 is done we should add support for publishing to PyPI similar > to how we publish to maven central. > Note: one of the open questions is what to do about package name since > someone has registered the package name PySpark on PyPI - we could use > ApachePySpark or we could try and get find who registered PySpark and get > them to transfer it to us (since they haven't published anything so maybe > fine?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636886#comment-15636886 ] Nattavut Sutyanyong commented on SPARK-17348: - I'd like to a note that a piece of existing code that closes to address this problem but only done for the scalar subquery context can be found by searching this pattern in CheckAnalysis.scala (line 126): _"The correlated scalar subquery can only contain equality predicates:"_ An existing test coverage for that scenario is at _SubquerySuite/non-equal correlated scalar subquery_ > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802 ] György Süveges edited comment on SPARK-4563 at 11/4/16 4:13 PM: +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, we change the spark.driver.host corresponding java property. But as noticed spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 However in [WebUI|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L139] on master I can still see places where only SPARK_LOCAL_IP is checked only I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? was (Author: gyorgy_suve...@epam.com): +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, we change the spark.driver.host corresponding java property. But as noticed spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? > Allow spark driver to bind to different ip then advertise ip > > > Key: SPARK-4563 > URL: https://issues.apache.org/jira/browse/SPARK-4563 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Long Nguyen >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Spark driver bind ip and advertise is not configurable. spark.driver.host is > only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option > to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802 ] György Süveges edited comment on SPARK-4563 at 11/4/16 4:01 PM: +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, we change the spark.driver.host corresponding java property. But as noticed spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? was (Author: gyorgy_suve...@epam.com): +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, then we change the spark.driver.host corresponding java property. But as noticed spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? > Allow spark driver to bind to different ip then advertise ip > > > Key: SPARK-4563 > URL: https://issues.apache.org/jira/browse/SPARK-4563 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Long Nguyen >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Spark driver bind ip and advertise is not configurable. spark.driver.host is > only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option > to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip
[ https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636802#comment-15636802 ] György Süveges edited comment on SPARK-4563 at 11/4/16 4:00 PM: +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, then we change the spark.driver.host corresponding java property. But as noticed spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? was (Author: gyorgy_suve...@epam.com): +1 However, it's related to SPARK_LOCAL_IP and Spark UI, I think it's similar. We use AWS EMR5 (Spark 2.0 over YARN) While developing we use spark from intellij in yarn-client mode. To connect to AWS, OpenVPN is applied with biderectional routing so the executors in amazon can access the driver on the local workstation. With OpenVPN there are multiple network interfaces then, so SPARK_LOCAL_IP should be set. But it's really inconvenient to set SPARK_LOCAL_IP each time we connect to OpenVPN and get different IP. So we decided to set it automatically from code. Since SPARK_LOCAL_IP is env var and env vars are inmutable in java, then we change the spark.driver.host corresponding java property. But as notices spark.driver.host is only second order citizen in spark, for some features only the SPARK_LOCAL_IP is taken into account. For example Spark UI listens on the host set in SPARK_LOCAL_IP, but ignores the spark.driver.host totally. I hope this improvement will fix this issue, as I saw in https://github.com/apache/spark/commit/2cd1bfa4f0c6625b0ab1dbeba2b9586b9a6a9f42 I think SPARK_LOCAL_IP and spark.driver.host should be equivalent counterparts, shouldn't they? > Allow spark driver to bind to different ip then advertise ip > > > Key: SPARK-4563 > URL: https://issues.apache.org/jira/browse/SPARK-4563 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Long Nguyen >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.1.0 > > > Spark driver bind ip and advertise is not configurable. spark.driver.host is > only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option > to set advertised ip/hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org