date:20170217

[jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871427#comment-15871427
 ] 

Sean Owen commented on SPARK-14194:
---

This is hard to fix because the source of text data splits this into lines 
before it's ever seen by the CSV parser. I can imagine trying to stitch them 
back together with a transform over a window of lines, but it's going to be 
hard to do given how the plumbing works.

> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.1.0
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19646.
---
Resolution: Not A Problem

I have a good guess that this is not a bug, and it's because you're reusing the 
objects you get from the API, and not cloning them. The objects are reused by 
the InputFormat, so you have to.

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Priority: Minor
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871436#comment-15871436
 ] 

Sean Owen commented on SPARK-19644:
---

What you have described so far is not a memory leak in Spark. It's normal for 
the heap to grow unless it has reason to even garbage collect. You're not 
evidently running out of memory. You're talking about a heap change of 1.1 to 
1.3MB, which is trivial (is this a typo?). I'd close this unless you have a 
clearer case.

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871444#comment-15871444
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


you can ignore that change in memory but if you look in the snapshot the number 
of instances of class scala.collection.immutable.$colon$colon it had increased 
too high and it keep on increasing over the period of time

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871452#comment-15871452
 ] 

Sean Owen commented on SPARK-19644:
---

There are just linked list objects. Why are they too high? if you have plenty 
of heap, Java won't bother GCing  until it needs to.

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19642) Improve the security guarantee for rest api and ui

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19642:
--
  Priority: Major  (was: Critical)
Issue Type: Task  (was: Improvement)

What are the potential holes here? I get the general idea but is there anything 
specific here, even? I just don't know if this is actionable.

> Improve the security guarantee for rest api and ui
> --
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). Also current rest 
> api may be a security hole. Open this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19524:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Yes, if anything, open a PR to improve the docs

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19640:
--
Target Version/s:   (was: 1.5.2)
   Fix Version/s: (was: 1.5.2)

Please read http://spark.apache.org/contributing.html first
I don't think we'd make any changes to 1.5 at this point, as there won't be 
more 1.5 releases, so I wouldn't bother.

> Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
> --
>
> Key: SPARK-19640
> URL: https://issues.apache.org/jira/browse/SPARK-19640
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Stephen Kinser
>Priority: Trivial
>  Labels: documentation, easyfix
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
> uses import statement of package path that does not exist 
> import org.apache.spark.ml.feature.CountVectorizer
> import org.apache.spark.mllib.util.CountVectorizerModel
> this should be revised to what it is like in spark 1.6+
> import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19551) Theme for PySpark documenation could do with improving

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871475#comment-15871475
 ] 

Sean Owen commented on SPARK-19551:
---

Ping [~arthur-tacca]  If this is a doc build config change, that's easy, but, 
what's the change?
If you mean someone should write a whole framework for doc themes, no I don't 
think that's in scope for Spark.

> Theme for PySpark documenation could do with improving
> --
>
> Key: SPARK-19551
> URL: https://issues.apache.org/jira/browse/SPARK-19551
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Arthur Tacca
>Priority: Minor
>
> I have found the Python Spark documentation hard to navigate for two reasons:
> * Each page in the documentation is huge, because the whole of the 
> documentation is split up into only a few chunks.
> * The methods for each class is not listed in a short form, so the only way 
> to look through them is to browse past the full documentation for all methods 
> (including parameter lists, examples, etc.).
> This has irritated someone enough that they have done [their own build of the 
> pyspark documentation|http://takwatanabe.me/pyspark/index.html]. In 
> comparison to the official docs they are a delight to use. But of course it 
> is not clear whether they'll be kept up to date, which is why I'm asking here 
> that the official docs are improved. Perhaps that site could be used as 
> inspiration? I don't know much about these things, but it appears that the 
> main change they have made is to switch to the "read the docs" theme.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19449.
---
Resolution: Not A Problem

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
> "");
> }
> static Dataset getDummyData(SparkSession spark, int numberRows, int 
> numberFeatures, int labelUpperBound) {
> StructType schema = new StructType(new StructField[]{
> new StructField("label", DataTypes.DoubleType, false, 
> Metadata.empty()),
> new StructField("features", new VectorUDT(), false, 
> Metadata.empty())
> });
> double[][] vectors = prepareData(numberRows, numberFeatures);
> Random random = new Random(seed);
> List dataTest = new ArrayList<>();
> for

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871477#comment-15871477
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


No i had just given 2 GB heap and if there is not any reference to them the 
Full GC should clean them but it is not get cleaned thats why i think there is 
memory leak

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6100) Distributed linear algebra in PySpark/MLlib

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6100.
--
Resolution: Done

> Distributed linear algebra in PySpark/MLlib
> ---
>
> Key: SPARK-6100
> URL: https://issues.apache.org/jira/browse/SPARK-6100
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is an umbrella JIRA for the Python API of distributed linear algebra in 
> MLlib. The goal is to make Python API on par with the Scala/Java API. We 
> should try wrapping Scala implementations as much as possible, instead of 
> implementing them in Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6227:
-
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-6100)

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>Assignee: Manoj Kumar
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871477#comment-15871477
 ] 

Deenbandhu Agarwal edited comment on SPARK-19644 at 2/17/17 9:07 AM:
-

No i had just given 2 GB driver memory and if there is not any reference to 
them the Full GC should clean them but it is not get cleaned thats why i think 
there is memory leak


was (Author: deenbandhu):
No i had just given 2 GB heap and if there is not any reference to them the 
Full GC should clean them but it is not get cleaned thats why i think there is 
memory leak

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12042) Python API for mllib.stat.test.StreamingTest

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12042:
--
Affects Version/s: 1.6.0
   Issue Type: Improvement  (was: Sub-task)
   Parent: (was: SPARK-11937)

> Python API for mllib.stat.test.StreamingTest
> 
>
> Key: SPARK-12042
> URL: https://issues.apache.org/jira/browse/SPARK-12042
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.6.0
>Reporter: Yanbo Liang
>
> Python API for mllib.stat.test.StreamingTest.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14819) Improve the "SET" and "SET -v" command

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14819.
---
Resolution: Duplicate

> Improve the "SET" and "SET -v" command
> --
>
> Key: SPARK-14819
> URL: https://issues.apache.org/jira/browse/SPARK-14819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Bo Meng
>
> Currently {{SET}} and {{SET -v}} commands are similar to Hive {{SET}} command 
> except the following difference:
> 1. The result is not sorted;
> 2. When using {{SET}} and {{SET -v}}, in addition to the Hive related 
> properties, it will also list all the system properties and environment 
> properties, which is very useful in some cases.
> This JIRA is trying to make the current {{SET}} command more consistent to 
> Hive output. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14754) Metrics as logs are not coming through slf4j

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14754.
---
Resolution: Won't Fix

> Metrics as logs are not coming through slf4j
> 
>
> Key: SPARK-14754
> URL: https://issues.apache.org/jira/browse/SPARK-14754
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.0, 1.6.1, 1.6.2
>Reporter: Monani Mihir
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Based on codahale's metric documentation, *Slf4jsink.scala* should have 
> *class name* for log4j to print metrics in log files. Metric name is missing 
> in current Slf4jsink.scala file. 
> Refer to this link :- 
> https://dropwizard.github.io/metrics/3.1.0/manual/core/#man-core-reporters-slf4j



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15024) NoClassDefFoundError in spark-examples due to Guava relocation

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15024.
---
Resolution: Not A Problem

Per PR, it has actually not been reproducible in a long time, so considering 
this fixed by something else

> NoClassDefFoundError in spark-examples due to Guava relocation
> --
>
> Key: SPARK-15024
> URL: https://issues.apache.org/jira/browse/SPARK-15024
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.1, 1.6.2
> Environment: Built using:
> mvn package -DrecompileMode=all -Pbigtop-dist -Pyarn -Phadoop-2.6 -Phive 
> -Phive-thriftserver -Dprotobuf.version=2.5.0
>Reporter: Aaron Tokhy
>Priority: Minor
>  Labels: bigtop, example, guava
>
> Currently, the spark-examples submodule builds 2 main JAR files.  One is a 
> JAR file without any shaded/relocated dependencies 
> (target/spark-examples_.jar), and another is a 
> JAR file containing all of the shaded dependencies (under 
> target/scala-/spark-examples--hadoop.jar).
>   The shaded spark-examples JAR comes out to be around ~120MB and contains 
> many duplicates already found in the spark-assembly.
> Since spark-assembly shades AND relocates Guava (to org.spark-project.guava), 
> and most of its dependencies are already provided by spark-assembly, the 
> non-shaded spark-examples JAR is still unable to find Guava at runtime as no 
> relocations occur in the non shaded spark-examples.jar.
> An example of one failure (spark as built by the Bigtop distribution):
> {code}
> $ spark-submit --deploy-mode client --class 
> org.apache.spark.examples.sql.hive.HiveFromSpark 
> /usr/lib/spark/lib/spark-examples.jar
> ...
> ...
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/io/Files
>   at 
> org.apache.spark.examples.sql.hive.HiveFromSpark$.(HiveFromSpark.scala:35)
>   at 
> org.apache.spark.examples.sql.hive.HiveFromSpark$.(HiveFromSpark.scala)
>   at 
> org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: com.google.common.io.Files
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 12 more
> Command exiting with ret '1'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15992) Code cleanup mesos coarse backend offer evaluation workflow

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15992.
---
Resolution: Won't Fix

> Code cleanup mesos coarse backend offer evaluation workflow
> ---
>
> Key: SPARK-15992
> URL: https://issues.apache.org/jira/browse/SPARK-15992
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>  Labels: code-cleanup
>
> The offer acceptance workflow is a little hard to follow and not very 
> extensible for future considerations for offers. This is a patch that makes 
> the workflow a little more explicit in its handling of offer resources.
> Patch incoming



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16526) Benchmarking Performance for Fast HashMap Implementations and Set Knobs

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16526.
---
Resolution: Won't Fix

> Benchmarking Performance for Fast HashMap Implementations and Set Knobs
> ---
>
> Key: SPARK-16526
> URL: https://issues.apache.org/jira/browse/SPARK-16526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Qifan Pu
>
> Add benchmark results for two fast hashmap implementations. Set the rule on 
> how to pick between the two (or directly fallback to 2nd level hashmap) based 
> on benchmark results.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17605) Add option spark.usePython and spark.useR for applications that use both pyspark and sparkr

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17605.
---
Resolution: Won't Fix

> Add option spark.usePython and spark.useR for applications that use both 
> pyspark and sparkr
> ---
>
> Key: SPARK-17605
> URL: https://issues.apache.org/jira/browse/SPARK-17605
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Reporter: Jeff Zhang
>
> For applications (zeppelin & livy ) that use both pyspark and sparkr, we have 
> to duplicate the code in SparkSubmit to figure the where is the requirement 
> resources for pyspark & sparkr (pyspark.zip, py4j, sparkr.zip and R file in 
> jars). It would be better to provide option spark.usePython and spark.useR, 
> so that downstream project which use both pyspark and sparkr in one spark 
> application don't need to duplicate the code in SparkSubmit, can just 
> leverage these 2 options. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871491#comment-15871491
 ] 

Takeshi Yamamuro commented on SPARK-19637:
--

`from_json`, too. I'm not sure but any reason that we didn't add these 
functions in FunctionRegistry? cc: [~marmbrus]
I think it's easy to add like: 
https://github.com/apache/spark/compare/master...maropu:SPARK-19637

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871493#comment-15871493
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

Thank you very much for the quick reply.
All I did was the following in spark-shell:
val x = sc.binaryRecords('binary_file.bin',3073)
val t= x.take(3)
t(0)
t(1)
t(2)
// all returning the same array, even though they shouldn't be the same

in pyspark, I do the same:
x = sc.binaryRecords('binary_file.bin',3073)
t = x.take(3)
t[0]
t[1]
t[2]
// different legit results, verified manually as well.



> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Priority: Minor
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5929.
--
Resolution: Won't Fix

> Pyspark: Register a pip requirements file with spark_context
> 
>
> Key: SPARK-5929
> URL: https://issues.apache.org/jira/browse/SPARK-5929
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: buckhx
>Priority: Minor
>
> I've been doing a lot of dependency work with shipping dependencies to 
> workers as it is non-trivial for me to have my workers include the proper 
> dependencies in their own environments.
> To circumvent this, I added a addRequirementsFile() method that takes a pip 
> requirements file, downloads the packages, repackages them to be registered 
> with addPyFiles and ship them to workers.
> Here is a comparison of what I've done on the Palantir fork 
> https://github.com/buckheroux/spark/compare/palantir:master...master



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871494#comment-15871494
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


And after 40-50 hours full gc is too frequent that all cores of machines are 
over utilized and batches start to queue up in streaming and I need to restart 
the streaming

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15155) Optionally ignore default role resources

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15155.
---
Resolution: Won't Fix

> Optionally ignore default role resources
> 
>
> Key: SPARK-15155
> URL: https://issues.apache.org/jira/browse/SPARK-15155
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Chris Heller
>
> SPARK-6284 added support for Mesos roles, but the framework will still accept 
> resources from both the reserved role specified in {{spark.mesos.role}} and 
> the default role {{*}}.
> I'd like to propose the addition of a new boolean property: 
> {{spark.mesos.ignoreDefaultRoleResources}}. When this property is set Spark 
> will only accept resources from the role passed in the {{spark.mesos.role}} 
> property. If {{spark.mesos.role}} has not been set, 
> {{spark.mesos.ignoreDefaultRoleResources}} has no effect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16931) PySpark access to data-frame bucketing api

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16931.
---
Resolution: Won't Fix

> PySpark access to data-frame bucketing api
> --
>
> Key: SPARK-16931
> URL: https://issues.apache.org/jira/browse/SPARK-16931
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Greg Bowyer
>
> Attached is a patch that enables bucketing for pyspark using the dataframe 
> API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871503#comment-15871503
 ] 

Sean Owen commented on SPARK-19644:
---

You only show 400MB of lists in your screen dump.
Running out of memory doesn't mean a leak. The question is what is holding on 
to the memory? This isn't a big heap to begin with.


> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871515#comment-15871515
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


Yes that's right running out of memory doesn't mean a leak but gradual increase 
in heap size and inability of GC to clear the memory is a memory leak. Ideally 
the number of linked list objects should not be increasing over the period of 
time and that increase is suggesting that there is a memory leak. 

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871519#comment-15871519
 ] 

Sean Owen commented on SPARK-19644:
---

This is still not necessarily true in general. Your app could be retaining 
state. This is why this isn't actionable as-is.

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871528#comment-15871528
 ] 

Deenbandhu Agarwal edited comment on SPARK-19644 at 2/17/17 9:34 AM:
-

We are not using any state or window operation and not using any check pointing 
so i don't think  that app is retaining state.


was (Author: deenbandhu):
We are not using any state or window operation and not using any check pointing 
so i don't think  that app is retaining state

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Deenbandhu Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871528#comment-15871528
 ] 

Deenbandhu Agarwal commented on SPARK-19644:


We are not using any state or window operation and not using any check pointing 
so i don't think  that app is retaining state

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Hongyao Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871529#comment-15871529
 ] 

Hongyao Zhao commented on SPARK-19645:
--

It's seems that this is related to filesystem type.
All the test cases which first stop the stream and then start the stream do not 
throw this rename exception. 
But if change from local filesystem to hdfs, this exception will be thrown.
Maybe we should add an overwrite option to rename function when the filesystem 
is hdfs?

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19646:
-

Assignee: Sean Owen
Priority: Major  (was: Minor)

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-19646:
---

Ah, I take it back. With that info I think this is in fact a problem. Although 
the problem is indeed because of Hadoop reusing Writables, this is not a case 
where the user is touching Writables. binaryRecords is getting the byte[] from 
a BytesWritable but actually this reference is the same every time, including 
the internal byte array. It needs to be copied. Simple fix.

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Priority: Minor
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871536#comment-15871536
 ] 

guifeng commented on SPARK-19645:
-

spark's default hadoop version is hadoop 2.2 that rename method don't exist 
overwrite options, so we need delete(if file exist) and then rename file.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19646:


Assignee: Sean Owen  (was: Apache Spark)

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19646:


Assignee: Apache Spark  (was: Sean Owen)

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Apache Spark
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871542#comment-15871542
 ] 

Sean Owen commented on SPARK-19645:
---

Note that master requires Hadoop 2.6+ now.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871543#comment-15871543
 ] 

Apache Spark commented on SPARK-19646:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16974

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19533:


Assignee: Sean Owen  (was: Apache Spark)

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19533:


Assignee: Apache Spark  (was: Sean Owen)

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871545#comment-15871545
 ] 

Apache Spark commented on SPARK-19533:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16961

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19534:
--
Comment: was deleted

(was: User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16961)

> Convert Java tests to use lambdas, Java 8 features
> --
>
> Key: SPARK-19534
> URL: https://issues.apache.org/jira/browse/SPARK-19534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a 
> significant sub-task in its own right. This shouldn't mean that 'old' APIs go 
> untested because there are no separate Java 8 APIs; it's just syntactic sugar 
> for calls to the same APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871552#comment-15871552
 ] 

guifeng commented on SPARK-19645:
-

But, spark mater don't set rename options that support overwrite. 

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871552#comment-15871552
 ] 

guifeng edited comment on SPARK-19645 at 2/17/17 9:50 AM:
--

But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.


was (Author: guifengl...@gmail.com):
But, spark mater don't set rename options that support overwrite. 

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871595#comment-15871595
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

Thank you very much for the speedy fix!

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871597#comment-15871597
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

What's puzzling though, is I looked at pyspark's implementation of 
binaryRecords, and it's just calling _jsc.binaryRecords and wrapping it with a 
pyspark RDD
so, if it is indeed calling the scala implementation, shouldn't pyspark have 
the same problem?

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871552#comment-15871552
 ] 

guifeng edited comment on SPARK-19645 at 2/17/17 10:36 AM:
---

 But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.
[~srowen] After use hadoop 2.6+ remain failed. 


was (Author: guifengl...@gmail.com):
But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Comment: was deleted

(was: aBut, spark mater don't set rename options that support overwrite, so I 
think rename will failed also. After use hadoop 2.6+ remain failed.)

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871627#comment-15871627
 ] 

guifeng commented on SPARK-19645:
-

aBut, spark mater don't set rename options that support overwrite, so I think 
rename will failed also. After use hadoop 2.6+ remain failed.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-17 Thread Mahmoud Rawas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871635#comment-15871635
 ] 

Mahmoud Rawas commented on SPARK-16920:
---

It seems that there is no N^2 complexity issue, and as for the stress test I 
have added a guide on how to perform one with some explanation on the fix, 
please review the following gist and let me know if you prefer any changes.

https://gist.github.com/mhmoudr/3681668f0ae56ca70cd95c8602f963e1

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871680#comment-15871680
 ] 

Sean Owen commented on SPARK-19645:
---

I'm not suggesting it works with Hadoop 2.6; I'm responding to the comment 
above about lack of a rename method. It's available now.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871678#comment-15871678
 ] 

Sean Owen commented on SPARK-19646:
---

I think it's because the array is copied elsewhere as it moves between the JVM 
and Python anyway

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2017-02-17 Thread Jose Soltren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871702#comment-15871702
 ] 

Jose Soltren commented on SPARK-18091:
--

FWIW, anyone pulling this fix in the future will also want 
https://github.com/apache/spark/pull/16244, or else a bunch of tests will fail.

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Critical
> Fix For: 2.0.3, 2.1.0
>
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871716#comment-15871716
 ] 

guifeng commented on SPARK-19645:
-

So, I think current workaround is to delete(if delta file exist) and then 
rename file.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19647) Spark query hive is extremelly slow even the result data is small

2017-02-17 Thread wuchang (JIRA)

wuchang created SPARK-19647:
---

 Summary: Spark query hive is extremelly slow even the result data 
is small
 Key: SPARK-19647
 URL: https://issues.apache.org/jira/browse/SPARK-19647
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.0.2
Reporter: wuchang
Priority: Critical


I am using spark 2.0.0 to query hive table:

my sql is:

select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.

When I run this sql in spark-shell,it is very fast, USE about 2 seconds .

But then the problem comes when I try to implement this query by my python code.

I am using Spark 2.0.0 and write a very simple pyspark program, code is:

from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
>From the info log , I find: although I use "limit 10" to tell spark that I 
>just want the first 10 records , but spark still scan and read all files(in my 
>case, the source data of this view contains 100 files and each file's size is 
>about 1G) of the view , So , there are nearly 100 tasks , each task read a 
>file , and all the task is executed serially. I use nearlly 15 minutes to 
>finish these 100 tasks! but what I want is just to get the first 10 
>records.

So , I don't know what to do and what is wrong;

Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19633) FileSource read from FileSink

2017-02-17 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871876#comment-15871876
 ] 

Liwei Lin commented on SPARK-19633:
---

Hi [~marmbrus], I'd like to take this if it's ok by you.

Just to confirm, you meant FileSource should also use MetadataLogFileIndex over 
InMemoryFileIndex whenever possible, right?

> FileSource read from FileSink
> -
>
> Key: SPARK-19633
> URL: https://issues.apache.org/jira/browse/SPARK-19633
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Right now, you can't start a streaming query from a location that is being 
> written to by the file sink.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19648) Unable to access column containing '.' for approxQuantile function on DataFrame

2017-02-17 Thread John Compitello (JIRA)

John Compitello created SPARK-19648:
---

 Summary: Unable to access column containing '.' for approxQuantile 
function on DataFrame
 Key: SPARK-19648
 URL: https://issues.apache.org/jira/browse/SPARK-19648
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2
 Environment: Running spark in an ipython prompt on Mac OSX. 
Reporter: John Compitello


It seems that the approx quantiles method does not offer any way to access a 
column with a period in string name. I am aware of the backtick solution, but 
it does not work in this scenario.

For example, let's say I have a column named 'va.x'. Passing approx quantiles 
this string without backticks results in the following error:

'Cannot resolve column name '`va.x`' given input columns: '

Note that backticks seem to have been automatically inserted, but it cannot 
find column name regardless. 

If I do include backticks, I get a different error. An IllegalArgumentException 
is thrown as follows:

"IllegalArgumentException: 'Field "`va.x`" does not exist."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19622:
-

Assignee: StanZhai

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19622.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16953
[https://github.com/apache/spark/pull/16953]

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19647) Spark query hive is extremelly slow even the result data is small

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19647.
---
Resolution: Invalid

Questions should go to u...@spark.apache.org

> Spark query hive is extremelly slow even the result data is small
> -
>
> Key: SPARK-19647
> URL: https://issues.apache.org/jira/browse/SPARK-19647
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: wuchang
>Priority: Critical
>
> I am using spark 2.0.0 to query hive table:
> my sql is:
> select * from app.abtestmsg_v limit 10
> Yes, I want to get the first 10 records from a view app.abtestmsg_v.
> When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
> But then the problem comes when I try to implement this query by my python 
> code.
> I am using Spark 2.0.0 and write a very simple pyspark program, code is:
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import *
> import json
> hc = HiveContext(sc)
> hc.setConf("hive.exec.orc.split.strategy", "ETL")
> hc.setConf("hive.security.authorization.enabled",false)
> zj_sql = 'select * from app.abtestmsg_v limit 10'
> zj_df = hc.sql(zj_sql)
> zj_df.collect()
> From the info log , I find: although I use "limit 10" to tell spark that I 
> just want the first 10 records , but spark still scan and read all files(in 
> my case, the source data of this view contains 100 files and each file's size 
> is about 1G) of the view , So , there are nearly 100 tasks , each task read a 
> file , and all the task is executed serially. I use nearlly 15 minutes to 
> finish these 100 tasks! but what I want is just to get the first 10 
> records.
> So , I don't know what to do and what is wrong;
> Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19622:
--
Fix Version/s: 2.1.1

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19547.
---
Resolution: Invalid

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
> at scala.util.

[jira] [Resolved] (SPARK-19593) Records read per each kinesis transaction

2017-02-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19593.
---
Resolution: Invalid

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42M/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2017-02-17 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871975#comment-15871975
 ] 

Dongjoon Hyun commented on SPARK-14194:
---

+1 for @srowen 's opinion.

> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.1.0
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872060#comment-15872060
 ] 

Nick Dimiduk commented on SPARK-19638:
--

[~maropu] I have placed a breakpoint in the ES connector's implementation of 
{{PrunedFilterScan#buildScan(Array[String], Array[Filter])}}. Here I see no 
filters for the struct columns. Indeed this is precisely where the log messages 
are produced. For this reason, I believe this to be an issue with Catalyst, not 
the connector. Perhaps you can guide me through further debugging? Thanks.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19522:


Assignee: Andrew Or  (was: Apache Spark)

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872109#comment-15872109
 ] 

Apache Spark commented on SPARK-19522:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/16975

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19522:


Assignee: Apache Spark  (was: Andrew Or)

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used

2017-02-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19500.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 16844
[https://github.com/apache/spark/pull/16844]

> Fail to spill the aggregated hash map when radix sort is used
> -
>
> Key: SPARK-19500
> URL: https://issues.apache.org/jira/browse/SPARK-19500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> Radix sort requires that only half of the array could be occupied. But the 
> aggregated hash map have a off-by-1 bug that could have 1 more item than half 
> of the array, when this happen, the spilling will fail as:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
> in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 
> 10.0 (TID 23899, 10.145.253.180, executor 24): 
> java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>  
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
> at org.apache.spark.scheduler.Task.run(Task.scala:99) 
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at scala.Option.foreach(Option.scala:257) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137)
>  
> ... 32 more 
> Caused by: java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spar

[jira] [Commented] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872270#comment-15872270
 ] 

Apache Spark commented on SPARK-19610:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16976

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19610:


Assignee: (was: Apache Spark)

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19610:


Assignee: Apache Spark

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872319#comment-15872319
 ] 

Shixiong Zhu commented on SPARK-19645:
--

[~guifengl...@gmail.com]  Thanks for reporting. Could you submit a PR to fix it?

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2017-02-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18986.

   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.2.0

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872413#comment-15872413
 ] 

Nick Dimiduk commented on SPARK-19638:
--

Debugging. I'm looking at the match expression in 
[{{DataSourceStrategy#translateFilter(Expression)}}|https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L509].
 The predicate comes in as a {{EqualTo(GetStructField, Literal)}}. This doesn't 
match any of the cases. I was expecting it to step into the [{{case 
expressions.EqualTo(a: Attribute, Literal(v, t)) 
=>}}|https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L511]
 on line 511 but execution steps past this point. Upon investigation, 
{{GetStructField}} does not extend {{Attribute}}.

>From this point, the {{EqualTo}} condition involving the struct field is 
>dropped from the filter set pushed down to the ES connector. Thus I believe 
>this is an issue in Spark, not in the connector.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18285) approxQuantile in R support multi-column

2017-02-17 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18285.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.2.0

> approxQuantile in R support multi-column
> 
>
> Key: SPARK-18285
> URL: https://issues.apache.org/jira/browse/SPARK-18285
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: zhengruifeng
>Assignee: Yanbo Liang
> Fix For: 2.2.0
>
>
> approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19517) KafkaSource fails to initialize partition offsets

2017-02-17 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19517.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> KafkaSource fails to initialize partition offsets
> -
>
> Key: SPARK-19517
> URL: https://issues.apache.org/jira/browse/SPARK-19517
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Roberto Agostino Vitillo
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19517ProposalforfixingKafkaOffsetMetadata.pdf
>
>
> A Kafka source with many partitions can cause the check-pointing logic to 
> fail on restart. I got the following exception when trying to restart a 
> Structured Streaming app that reads from a Kafka topic with hundred 
> partitions.
> {code}
> 17/02/08 15:10:09 ERROR StreamExecution: Query [id = 
> 24e2a21a-4545-4a3e-80ea-bbe777d883ab, runId = 
> 025609c9-d59c-4de3-88b3-5d5f7eda4a66] terminated with error
> java.lang.IllegalArgumentException: Expected e.g. 
> {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"telemetry":{"92":302854
>   at 
> org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply(KafkaSourceOffset.scala:59)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:134)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:138)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:121)
>…
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872441#comment-15872441
 ] 

Aaditya Ramesh commented on SPARK-19525:


[~zsxwing] Actually, we are compressing the data in the RDDs, not the streaming 
metadata. We compress all records in a partition together and write them to our 
DFS. In our case, the snappy-compressed size of each RDD partition is around 18 
MB, with 84 partitions, for a total of 1.5 GB per RDD.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19517) KafkaSource fails to initialize partition offsets

2017-02-17 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19517:


Assignee: Roberto Agostino Vitillo

> KafkaSource fails to initialize partition offsets
> -
>
> Key: SPARK-19517
> URL: https://issues.apache.org/jira/browse/SPARK-19517
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Roberto Agostino Vitillo
>Assignee: Roberto Agostino Vitillo
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19517ProposalforfixingKafkaOffsetMetadata.pdf
>
>
> A Kafka source with many partitions can cause the check-pointing logic to 
> fail on restart. I got the following exception when trying to restart a 
> Structured Streaming app that reads from a Kafka topic with hundred 
> partitions.
> {code}
> 17/02/08 15:10:09 ERROR StreamExecution: Query [id = 
> 24e2a21a-4545-4a3e-80ea-bbe777d883ab, runId = 
> 025609c9-d59c-4de3-88b3-5d5f7eda4a66] terminated with error
> java.lang.IllegalArgumentException: Expected e.g. 
> {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"telemetry":{"92":302854
>   at 
> org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply(KafkaSourceOffset.scala:59)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:134)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:138)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:121)
>…
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872470#comment-15872470
 ] 

Michael Armbrust commented on SPARK-19637:
--

>From JSON is harder because the second argument is a StructType.  We could 
>consider accepting a string in the DDL format for declaring a tables schema 
>(i.e. {{a: Int, b: struct...}}.

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-17 Thread Stephen Kinser (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Kinser closed SPARK-19640.
--
Resolution: Won't Fix

> Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
> --
>
> Key: SPARK-19640
> URL: https://issues.apache.org/jira/browse/SPARK-19640
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Stephen Kinser
>Priority: Trivial
>  Labels: documentation, easyfix
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
> uses import statement of package path that does not exist 
> import org.apache.spark.ml.feature.CountVectorizer
> import org.apache.spark.mllib.util.CountVectorizerModel
> this should be revised to what it is like in spark 1.6+
> import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872652#comment-15872652
 ] 

Shixiong Zhu commented on SPARK-19525:
--

Hm, Spark should support compression for data in RDD. Which code path did you 
find that not compressing data?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872672#comment-15872672
 ] 

Shixiong Zhu commented on SPARK-19644:
--

[~deenbandhu] Could you check the GC root, please? These objects are from Scala 
reflection. Did you run the job in Spark shell?

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872695#comment-15872695
 ] 

Joshua Caplan commented on SPARK-19649:
---

Hadoop encountered the same situation and fixed it in their client.

> Spark YARN client throws exception if job succeeds and 
> max-completed-applications=0
> ---
>
> Key: SPARK-19649
> URL: https://issues.apache.org/jira/browse/SPARK-19649
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3
> Environment: EMR release label 4.8.x
>Reporter: Joshua Caplan
>Priority: Minor
>
> I believe the patch in SPARK-3877 created a new race condition between YARN 
> and the Spark client.
> I typically configure YARN not to keep *any* recent jobs in memory, as some 
> of my jobs get pretty large.
> {code}
> yarn-site yarn.resourcemanager.max-completed-applications 0
> {code}
> The once-per-second call to getApplicationReport may thus encounter a RUNNING 
> application followed by a not found application, and report a false negative.
> (typical) Executor log:
> {code}
> 17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
> 17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at 
> http://10.0.0.168:37046
> 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all 
> executors
> 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
> shut down
> 17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
> 17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
> 17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
> 17/01/09 19:31:24 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
> 17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> {code}
> Client log:
> {code}
> 17/01/09 19:31:23 INFO Client: Application report for 
> application_1483983939941_0056 (state: RUNNING)
> 17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 
> not found.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1483983939941_0056 is killed
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)

Joshua Caplan created SPARK-19649:
-

 Summary: Spark YARN client throws exception if job succeeds and 
max-completed-applications=0
 Key: SPARK-19649
 URL: https://issues.apache.org/jira/browse/SPARK-19649
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.3
 Environment: EMR release label 4.8.x
Reporter: Joshua Caplan
Priority: Minor


I have configured YARN not to keep *any* recent jobs in memory, as some of my 
jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
{code}

Client log:
{code}
17/01/09 19:31:23 INFO Client: Application report for 
application_1483983939941_0056 (state: RUNNING)
17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 not 
found.
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1483983939941_0056 is killed
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872698#comment-15872698
 ] 

Aaditya Ramesh commented on SPARK-19525:


We are suggesting to compress only before we write the checkpoint, not in 
memory. This is not happening right now - we just serialize the elements in the 
partition one by one and add to the serialization stream, according to 
{{ReliableCheckpointRDD.writePartitionToCheckpointFile}}:

{code}
val fileOutputStream = if (blockSize < 0) {
  fs.create(tempOutputPath, false, bufferSize)
} else {
  // This is mainly for testing purpose
  fs.create(tempOutputPath, false, bufferSize,
fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
}
val serializer = env.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
  serializeStream.writeAll(iterator)
} {
  serializeStream.close()
}

{code}

As you can see, we don't do any compression after the serialization step. In 
our patch, we just use the CompressionCodec and wrap the serialization stream 
in compression codec output stream, and correspondingly in the read.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Caplan updated SPARK-19649:
--
Description: 
I believe the patch in SPARK-3877 created a new race condition between YARN and 
the Spark client.

I typically configure YARN not to keep *any* recent jobs in memory, as some of 
my jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
{code}

Client log:
{code}
17/01/09 19:31:23 INFO Client: Application report for 
application_1483983939941_0056 (state: RUNNING)
17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 not 
found.
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1483983939941_0056 is killed
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
I have configured YARN not to keep *any* recent jobs in memory, as some of my 
jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be

[jira] [Updated] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19525:
-
Component/s: (was: Structured Streaming)
 Spark Core

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19525:
-
Summary: Enable Compression of RDD Checkpoints  (was: Enable Compression of 
Spark Streaming Checkpoints)

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872717#comment-15872717
 ] 

Shixiong Zhu commented on SPARK-19525:
--

I see. This is RDD checkpointing. Sounds a good idea.

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2017-02-17 Thread Joshua Caplan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872720#comment-15872720
 ] 

Joshua Caplan commented on SPARK-3877:
--

Done, as SPARK-19649 .

> The exit code of spark-submit is still 0 when an yarn application fails
> ---
>
> Key: SPARK-3877
> URL: https://issues.apache.org/jira/browse/SPARK-3877
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: yarn
> Fix For: 1.1.1, 1.2.0
>
>
> When an yarn application fails (yarn-cluster mode), the exit code of 
> spark-submit is still 0. It's hard for people to write some automatic scripts 
> to run spark jobs in yarn because the failure can not be detected in these 
> scripts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872728#comment-15872728
 ] 

Aaditya Ramesh commented on SPARK-19525:


Great, I will get the patch ready.

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-17 Thread Sameer Agarwal (JIRA)

Sameer Agarwal created SPARK-19650:
--

 Summary: Metastore-only operations shouldn't trigger a spark job
 Key: SPARK-19650
 URL: https://issues.apache.org/jira/browse/SPARK-19650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Sameer Agarwal


We currently trigger a spark job even for simple metastore operations ({{SHOW 
TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
otherwise get executed on a driver, it prevents a user from doing these 
operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-17 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-19650:
-

Assignee: Sameer Agarwal

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19651) ParallelCollectionRDD.colect should not issuse a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-19651:
---

 Summary: ParallelCollectionRDD.colect should not issuse a Spark job
 Key: SPARK-19651
 URL: https://issues.apache.org/jira/browse/SPARK-19651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19651) ParallelCollectionRDD.collect should not issuse a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19651:

Summary: ParallelCollectionRDD.collect should not issuse a Spark job  (was: 
ParallelCollectionRDD.colect should not issuse a Spark job)

> ParallelCollectionRDD.collect should not issuse a Spark job
> ---
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19651:

Summary: ParallelCollectionRDD.collect should not issue a Spark job  (was: 
ParallelCollectionRDD.collect should not issuse a Spark job)

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19651:


Assignee: Apache Spark  (was: Wenchen Fan)

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo