[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094607#comment-16094607 ] Aseem Bansal edited comment on SPARK-21483 at 7/20/17 12:29 PM: Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data. Raw data is just a java bean //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} was (Author: anshbansal): Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094607#comment-16094607 ] Aseem Bansal commented on SPARK-21483: -- Some pseudo code to show what I am trying to achieve {code:java} class MyTransformer implemenets Serializable { public FeaturesAndLabel transform(RawData rawData) { //Some logic which creates Features and Labels from raw data //FeaturesAndLabel is a bean which contains a SparseVector as features, and double as label } } {code} {code:java} Dataset dataset = //read from somewhere and create Dataset of RawData bean Dataset featuresAndLabels = dataset.transform(new MyTransformer()::transform) //use features and labels for machine learning {code} > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094314#comment-16094314 ] Aseem Bansal edited comment on SPARK-21483 at 7/20/17 9:11 AM: --- No it does not. Can you give a link to what you are referring to? And I am not using spark SQL. I am using Dataset's transformations only. was (Author: anshbansal): Now it does not. Can you give a link to what you are referring to? And I am not using spark SQL. I am using Dataset's transformations only. > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)
[ https://issues.apache.org/jira/browse/SPARK-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094315#comment-16094315 ] Aseem Bansal commented on SPARK-21482: -- There is a LabeledPoint in new ml api too https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint I am able to workaround via using my own class. But I thought the ML package was supposed to be used with the dataset's API. That's why I am saying it should support this. > Make LabeledPoint bean-compliant so it can be used in > Encoders.bean(LabeledPoint.class) > --- > > Key: SPARK-21482 > URL: https://issues.apache.org/jira/browse/SPARK-21482 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The LabeledPoint class is currently not bean-compliant as per spark > https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint > This makes it impossible to create a LabeledPoint via a dataset.tranform. It > should be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094314#comment-16094314 ] Aseem Bansal commented on SPARK-21483: -- Now it does not. Can you give a link to what you are referring to? And I am not using spark SQL. I am using Dataset's transformations only. > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
[ https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094297#comment-16094297 ] Aseem Bansal commented on SPARK-21483: -- How would you encode it otherwise? > Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in > Encoders.bean(Vector.class) > -- > > Key: SPARK-21483 > URL: https://issues.apache.org/jira/browse/SPARK-21483 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant > as per spark. > This makes it impossible to create a Vector via a dataset.tranform. It should > be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)
[ https://issues.apache.org/jira/browse/SPARK-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094295#comment-16094295 ] Aseem Bansal commented on SPARK-21482: -- I am using Java API. I tried a simple transformation with {noformat} dataset.transform(MyCustomToLabeledPointTransformer::transformer, Encoders.bean(LabeledPoint.class)) {noformat} and it threw bean-compliance exception. I am not sure whether the encoders should act on beans or not but clearly something is going on due to which they are acting on beans. > Make LabeledPoint bean-compliant so it can be used in > Encoders.bean(LabeledPoint.class) > --- > > Key: SPARK-21482 > URL: https://issues.apache.org/jira/browse/SPARK-21482 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The LabeledPoint class is currently not bean-compliant as per spark > https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint > This makes it impossible to create a LabeledPoint via a dataset.tranform. It > should be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)
Aseem Bansal created SPARK-21483: Summary: Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class) Key: SPARK-21483 URL: https://issues.apache.org/jira/browse/SPARK-21483 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Aseem Bansal The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant as per spark. This makes it impossible to create a Vector via a dataset.tranform. It should be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)
Aseem Bansal created SPARK-21482: Summary: Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class) Key: SPARK-21482 URL: https://issues.apache.org/jira/browse/SPARK-21482 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Aseem Bansal The LabeledPoint class is currently not bean-compliant as per spark https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint This makes it impossible to create a LabeledPoint via a dataset.tranform. It should be made bean-compliant so it can be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
Aseem Bansal created SPARK-21481: Summary: Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF Key: SPARK-21481 URL: https://issues.apache.org/jira/browse/SPARK-21481 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0, 2.1.0 Reporter: Aseem Bansal If we want to find the index of any input based on hashing trick then it is possible in https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF but not in https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. Should allow that for feature parity -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21473) Running Transform on a bean which has only setters gives NullPointerExcpetion
[ https://issues.apache.org/jira/browse/SPARK-21473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-21473: - Description: If I run the following using the Java API {code:java} dataset.map(Transformer::transform, Encoders.bean(BeanWithOnlySettersAndNoGetters.class)); {code} Then I get the below exception. I understand that it is not bean-compliant without the getters but the exception is wrong. Perhaps fixing the exception message would be a solution? {noformat} Caused by: java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) at org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) {noformat} was: If I run the following {code:java} dataset.map(Transformer::transform, Encoders.bean(BeanWithOnlySettersAndNoGetters.class)); {code} Then I get the below exception. I understand that it is not bean-compliant without the getters but the exception is wrong. Perhaps fixing the exception message would be a solution? {noformat} Caused by: java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) at org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) {noformat} > Running Transform on a bean which has only setters gives NullPointerExcpetion > - > > Key: SPARK-21473 > URL: https://issues.apache.org/jira/browse/SPARK-21473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > If I run the following using the Java API > {code:java} > dataset.map(Transformer::transform, > Encoders.bean(BeanWithOnlySettersAndNoGetters.class)); > {code} > Then I get the below exception. I understand that it is not bean-compliant > without the getters but the exception is wrong. Perhaps fixing the exception > message would be a solution? > {noformat} > Caused by: java.lang.NullPointerException > at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(I
[jira] [Created] (SPARK-21473) Running Transform on a bean which has only setters gives NullPointerExcpetion
Aseem Bansal created SPARK-21473: Summary: Running Transform on a bean which has only setters gives NullPointerExcpetion Key: SPARK-21473 URL: https://issues.apache.org/jira/browse/SPARK-21473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Aseem Bansal If I run the following {code:java} dataset.map(Transformer::transform, Encoders.bean(BeanWithOnlySettersAndNoGetters.class)); {code} Then I get the below exception. I understand that it is not bean-compliant without the getters but the exception is wrong. Perhaps fixing the exception message would be a solution? {noformat} Caused by: java.lang.NullPointerException at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) at org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) at org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142) at org.apache.spark.sql.Encoders.bean(Encoders.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17742) Spark Launcher does not get failed state in Listener
[ https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954714#comment-15954714 ] Aseem Bansal commented on SPARK-17742: -- [~daanvdn] We ended up using kafka messages to communicate to the web app that was using the launcher to launch the job whether the job was complete or failed. Dumped Launcher's states as they are broken. > Spark Launcher does not get failed state in Listener > - > > Key: SPARK-17742 > URL: https://issues.apache.org/jira/browse/SPARK-17742 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I tried to launch an application using the below code. This is dummy code to > reproduce the problem. I tried exiting spark with status -1, throwing an > exception etc. but in no case did the listener give me failed status. But if > a spark job returns -1 or throws an exception from the main method it should > be considered as a failure. > {code} > package com.example; > import org.apache.spark.launcher.SparkAppHandle; > import org.apache.spark.launcher.SparkLauncher; > import java.io.IOException; > public class Main2 { > public static void main(String[] args) throws IOException, > InterruptedException { > SparkLauncher launcher = new SparkLauncher() > .setSparkHome("/opt/spark2") > > .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar") > .setMainClass("com.example.Main") > .setMaster("local[2]"); > launcher.startApplication(new MyListener()); > Thread.sleep(1000 * 60); > } > } > class MyListener implements SparkAppHandle.Listener { > @Override > public void stateChanged(SparkAppHandle handle) { > System.out.println("state changed " + handle.getState()); > } > @Override > public void infoChanged(SparkAppHandle handle) { > System.out.println("info changed " + handle.getState()); > } > } > {code} > The spark job is > {code} > package com.example; > import org.apache.spark.sql.SparkSession; > import java.io.IOException; > public class Main { > public static void main(String[] args) throws IOException { > SparkSession sparkSession = SparkSession > .builder() > .appName("" + System.currentTimeMillis()) > .getOrCreate(); > try { > for (int i = 0; i < 15; i++) { > Thread.sleep(1000); > System.out.println("sleeping 1"); > } > } catch (InterruptedException e) { > e.printStackTrace(); > } > //sparkSession.stop(); > System.exit(-1); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10413) ML models should support prediction on single instances
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857888#comment-15857888 ] Aseem Bansal commented on SPARK-10413: -- Something to look at would be https://github.com/combust/mleap which provides this on top of spark > ML models should support prediction on single instances > --- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853629#comment-15853629 ] Aseem Bansal commented on SPARK-19449: -- [~sowen] My results are actually deterministic. No matter how many times I run it the number of true positives, true negatives, false positives, false negatives are always exactly the same. The problem is that they are always inconsistent too by exactly the same amount in the 2 implementations. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + > ""); > } > static Dataset getDummyData(SparkSession spark, int numberRows, int > numberFeatures, int labelUpperBound) { > StructType schema = new StructType(new StructField[]{ > new StructField("label",
[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851588#comment-15851588 ] Aseem Bansal commented on SPARK-19444: -- https://github.com/apache/spark/pull/16789 > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851576#comment-15851576 ] Aseem Bansal commented on SPARK-19449: -- Isn't the decision tree debug string print it as a series of IF-ELSE? I printed the debug string for the 2 random forest models and it was exactly the same. In other words the 2 implementations should be mathematically equivalent. The random processes for selecting data should not cause any issues as I ensured that the exact same data is going to both versions. It works for decision trees and random forest classifier is just majority vote of bunch of decision trees classifiers so I cannot see how that could be different. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);} > }, labels, features) + "\n" + > "\n" + > "** RandomForestClassifier\n" + > "** Original **" + getInfo(model2, testData) + "\n" + > "** New **" + getInfo(new Predictor() {double > predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, > features) + "\n" + > "\n" + >
[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851568#comment-15851568 ] Aseem Bansal commented on SPARK-19449: -- [~srowen] I removed some extra code. The part where I did the conversion is at the end in convertRandomForestModel method. Basically the above code does this - Prepare 1000 rows of data with 3 features randomly. Prepare 1000 labels randomly. I am not working on creating the model but the conversion. So having random data is not an issue. It will just be a horrible model. - Split the data in 80/20 ratio for training/test - train ml version of decision tree model and random forest model using the training set. Let's call them DT1 and RF1 - convert these to mllib version of the models. Let's call them DT2 and RF2 - Use the test set to predict labels using DT1, DT2, RF1, RF2. - Compare predicted labels DT1 with DT2. Same results - Compare predicted labels RF1 with RF2. Different results. There should not be any random results here as I have used seeds for random number generators everywhere and then used the exact same data for doing predictions using all 4 models. > Inconsistent results between ml package RandomForestClassificationModel and > mllib package RandomForestModel > --- > > Key: SPARK-19449 > URL: https://issues.apache.org/jira/browse/SPARK-19449 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Aseem Bansal > > I worked on some code to convert ml package RandomForestClassificationModel > to mllib package RandomForestModel. It was needed because we need to make > predictions on the order of ms. I found that the results are inconsistent > although the underlying DecisionTreeModel are exactly the same. So the > behavior between the 2 implementations is inconsistent which should not be > the case. > The below code can be used to reproduce the issue. Can run this as a simple > Java app as long as you have spark dependencies set up properly. > {noformat} > import org.apache.spark.ml.Transformer; > import org.apache.spark.ml.classification.*; > import org.apache.spark.ml.linalg.*; > import org.apache.spark.ml.regression.RandomForestRegressionModel; > import org.apache.spark.mllib.linalg.DenseVector; > import org.apache.spark.mllib.linalg.Vector; > import org.apache.spark.mllib.tree.configuration.Algo; > import org.apache.spark.mllib.tree.model.DecisionTreeModel; > import org.apache.spark.mllib.tree.model.RandomForestModel; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.Metadata; > import org.apache.spark.sql.types.StructField; > import org.apache.spark.sql.types.StructType; > import scala.Enumeration; > import java.util.ArrayList; > import java.util.List; > import java.util.Random; > abstract class Predictor { > abstract double predict(Vector vector); > } > public class MainConvertModels { > public static final int seed = 42; > public static void main(String[] args) { > int numRows = 1000; > int numFeatures = 3; > int numClasses = 2; > double trainFraction = 0.8; > double testFraction = 0.2; > SparkSession spark = SparkSession.builder() > .appName("conversion app") > .master("local") > .getOrCreate(); > Dataset data = getDummyData(spark, numRows, numFeatures, > numClasses); > Dataset[] splits = data.randomSplit(new double[]{trainFraction, > testFraction}, seed); > Dataset trainingData = splits[0]; > Dataset testData = splits[1]; > testData.cache(); > List labels = getLabels(testData); > List features = getFeatures(testData); > DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); > DecisionTreeClassificationModel model1 = > classifier1.fit(trainingData); > final DecisionTreeModel convertedModel1 = > convertDecisionTreeModel(model1, Algo.Classification()); > RandomForestClassifier classifier = new RandomForestClassifier(); > RandomForestClassificationModel model2 = classifier.fit(trainingData); > final RandomForestModel convertedModel2 = > convertRandomForestModel(model2); > System.out.println( > "** DecisionTreeClassifier\n" + > "** Original **" + getInfo(model1, testData) + "\n" + > "** New **" + getInfo(new Predictor() { > double predict(Vector vector) {return > convertedModel1.predict(vector);}
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + ""); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures); Random random = new Random(seed); List dataTest = new ArrayList<>(); for (double[] vector : vectors) { double label = (double) random.nextInt(2); dataTest.add(RowFactory.create(label, Vectors.dense(vector))); } return spark.createDataFrame(dataTest, schema); } static double[][] prepareData(int numRows, int numFeatures) { Random random = new Random(seed); double[][] result = new double[numRows][numFeatures]; for (int row = 0; row < numRows; row++) { for (int feature = 0; feature < numFeatures; feature++) { result[row][feature] = random.nextDouble(); } } return result; } static S
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) }); double[][] vectors = prepareData(numberRows, numberFeatures);
[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
[ https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19449: - Description: I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. So the behavior between the 2 implementations is inconsistent which should not be the case. The below code can be used to reproduce the issue. Can run this as a simple Java app as long as you have spark dependencies set up properly. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.em
[jira] [Created] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel
Aseem Bansal created SPARK-19449: Summary: Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel Key: SPARK-19449 URL: https://issues.apache.org/jira/browse/SPARK-19449 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Aseem Bansal I worked on some code to convert ml package RandomForestClassificationModel to mllib package RandomForestModel. It was needed because we need to make predictions on the order of ms. I found that the results are inconsistent although the underlying DecisionTreeModel are exactly the same. The below code can be used to reproduce the issue. {noformat} import org.apache.spark.ml.Transformer; import org.apache.spark.ml.classification.*; import org.apache.spark.ml.linalg.*; import org.apache.spark.ml.regression.RandomForestRegressionModel; import org.apache.spark.mllib.linalg.DenseVector; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.tree.configuration.Algo; import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.tree.model.RandomForestModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType; import scala.Enumeration; import java.util.ArrayList; import java.util.List; import java.util.Random; abstract class Predictor { abstract double predict(Vector vector); } public class MainConvertModels { public static final int seed = 42; public static void main(String[] args) { int numRows = 1000; int numFeatures = 3; int numClasses = 2; double trainFraction = 0.8; double testFraction = 0.2; SparkSession spark = SparkSession.builder() .appName("conversion app") .master("local") .getOrCreate(); //Dataset data = getData(spark, "libsvm", "/opt/spark2/data/mllib/sample_libsvm_data.txt"); Dataset data = getDummyData(spark, numRows, numFeatures, numClasses); Dataset[] splits = data.randomSplit(new double[]{trainFraction, testFraction}, seed); Dataset trainingData = splits[0]; Dataset testData = splits[1]; testData.cache(); List labels = getLabels(testData); List features = getFeatures(testData); DecisionTreeClassifier classifier1 = new DecisionTreeClassifier(); DecisionTreeClassificationModel model1 = classifier1.fit(trainingData); final DecisionTreeModel convertedModel1 = convertDecisionTreeModel(model1, Algo.Classification()); RandomForestClassifier classifier = new RandomForestClassifier(); RandomForestClassificationModel model2 = classifier.fit(trainingData); final RandomForestModel convertedModel2 = convertRandomForestModel(model2); LogisticRegression lr = new LogisticRegression(); LogisticRegressionModel model3 = lr.fit(trainingData); final org.apache.spark.mllib.classification.LogisticRegressionModel convertedModel3 = convertLogisticRegressionModel(model3); System.out.println( "** DecisionTreeClassifier\n" + "** Original **" + getInfo(model1, testData) + "\n" + "** New **" + getInfo(new Predictor() { double predict(Vector vector) {return convertedModel1.predict(vector);} }, labels, features) + "\n" + "\n" + "** RandomForestClassifier\n" + "** Original **" + getInfo(model2, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, features) + "\n" + "\n" + "** LogisticRegression\n" + "** Original **" + getInfo(model3, testData) + "\n" + "** New **" + getInfo(new Predictor() {double predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, features) + "\n" + ""); } static Dataset getData(SparkSession spark, String format, String location) { return spark.read() .format(format) .load(location); } static Dataset getDummyData(SparkSession spark, int numberRows, int numberFeatures, int labelUpperBound) { StructType schema = new StructType(new StructField[]{ new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), n
[jira] [Created] (SPARK-19445) Please remove tylerchap...@yahoo-inc.com subscription from u...@spark.apache.org
Aseem Bansal created SPARK-19445: Summary: Please remove tylerchap...@yahoo-inc.com subscription from u...@spark.apache.org Key: SPARK-19445 URL: https://issues.apache.org/jira/browse/SPARK-19445 Project: Spark Issue Type: IT Help Components: Project Infra Affects Versions: 2.1.0 Reporter: Aseem Bansal Whenever a mail is sent to u...@spark.apache.org I receive this email {noformat} This is an automatically generated message. tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc. Your message will not be forwarded. If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and someone will follow up with you shortly. If you require assistance with a legal matter, please send a message to legal-noti...@yahoo-inc.com Thank you! {noformat} It is clear that this user is no longer available. Please remove this email address from mailing list so we don't get so much spam. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851382#comment-15851382 ] Aseem Bansal commented on SPARK-19444: -- [~srowen] I can find the location at https://github.com/apache/spark/blob/master/docs/ml-features.md which led me to https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java#L40 but what is $example on:untyped_ops$ The imports are there. But seems this is broken. Then this is probably a parsing issue? > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19444) Tokenizer example does not compile without extra imports
[ https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-19444: - Priority: Minor (was: Major) > Tokenizer example does not compile without extra imports > > > Key: SPARK-19444 > URL: https://issues.apache.org/jira/browse/SPARK-19444 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Aseem Bansal >Priority: Minor > > The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer > does not compile without the following static import > import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19444) Tokenizer example does not compile without extra imports
Aseem Bansal created SPARK-19444: Summary: Tokenizer example does not compile without extra imports Key: SPARK-19444 URL: https://issues.apache.org/jira/browse/SPARK-19444 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.1.0 Reporter: Aseem Bansal The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer does not compile without the following static import import static org.apache.spark.sql.functions.*; -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Aseem Bansal created an issue  Spark / SPARK-19410 Links to API documentation are broken Issue Type: Documentation Affects Versions: 2.1.0 Assignee: Unassigned Components: Documentation Created: 31/Jan/17 08:55 Priority: Major Reporter: Aseem Bansal I was looking at https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param and saw that the links to API documentation are broken Add Comment
[jira] [Comment Edited] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734495#comment-15734495 ] Aseem Bansal edited comment on SPARK-10413 at 12/9/16 6:39 AM: --- Hi Is anyone working on this? And is there a JIRA ticket for having a predict method on PipelineModel? was (Author: anshbansal): Hi Is anyone working on this? > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10413) Model should support prediction on single instance
[ https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734495#comment-15734495 ] Aseem Bansal commented on SPARK-10413: -- Hi Is anyone working on this? > Model should support prediction on single instance > -- > > Key: SPARK-10413 > URL: https://issues.apache.org/jira/browse/SPARK-10413 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > Currently models in the pipeline API only implement transform(DataFrame). It > would be quite useful to support prediction on single instance. > UPDATE: This issue is for making predictions with single models. We can make > methods like {{def predict(features: Vector): Double}} public. > * This issue is *not* for single-instance prediction for full Pipelines, > which would require making predictions on {{Row}}s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18241) If Spark Launcher fails to startApplication then handle's state does not change
[ https://issues.apache.org/jira/browse/SPARK-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631898#comment-15631898 ] Aseem Bansal commented on SPARK-18241: -- Looking at the source code after mainClass = Utils.classForName(childMainClass) at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L695 I see that the exceptions are being printed instead of being thrown/sent to the listeners. The API says that startApplication is preferred but the various failures need to be sent via the handlers otherwise the listener API is not useful. Another case where failures are not sent via the Launcher API https://issues.apache.org/jira/browse/SPARK-17742 > If Spark Launcher fails to startApplication then handle's state does not > change > --- > > Key: SPARK-18241 > URL: https://issues.apache.org/jira/browse/SPARK-18241 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I am using Spark 2.0.0. I am using > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html > to submit my job. > If there is a failure after launcher's startapplication has been called but > before the spark job has actually started (i.e. in starting the spark process > that submits the job itself) there is > * no exception in the main thread that is submitting the job > * no exception in the job as it has not started > * no state change of the launcher > * the exception is logged in the error stream on the default logger name that > spark produces using the Job's main class. > Basically, it is not possible to catch an exception if it happens during that > time. The easiest way to reproduce it is to delete the JAR file or use an > invalid spark home while launching the job using sparkLauncher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18241) If Spark Launcher fails to startApplication then handle's state does not change
Aseem Bansal created SPARK-18241: Summary: If Spark Launcher fails to startApplication then handle's state does not change Key: SPARK-18241 URL: https://issues.apache.org/jira/browse/SPARK-18241 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.0.0 Reporter: Aseem Bansal I am using Spark 2.0.0. I am using https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html to submit my job. If there is a failure after launcher's startapplication has been called but before the spark job has actually started (i.e. in starting the spark process that submits the job itself) there is * no exception in the main thread that is submitting the job * no exception in the job as it has not started * no state change of the launcher * the exception is logged in the error stream on the default logger name that spark produces using the Job's main class. Basically, it is not possible to catch an exception if it happens during that time. The easiest way to reproduce it is to delete the JAR file or use an invalid spark home while launching the job using sparkLauncher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17742) Spark Launcher does not get failed state in Listener
[ https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535315#comment-15535315 ] Aseem Bansal edited comment on SPARK-17742 at 9/30/16 7:35 AM: --- I dug into the launcher code to see if I can figure out how it is working and see if I could find the bug. But when I reached LauncherServer's ServerConnection's handle method and found that this is socket programming I found it harder to find where the messages are coming from. Still trying to figure out but maybe someone who knows spark code better will find it easier to find the bug. was (Author: anshbansal): I dug into the launcher code to see if I can figure out how it is working and see if I could find the bug. But when I reached LauncherServer's ServerConnection's handle method and found that this is socket programming I found it harder to find where the messages are coming from. Still trying to figure out maybe someone who knows spark code better will find it easier to find the bug. > Spark Launcher does not get failed state in Listener > - > > Key: SPARK-17742 > URL: https://issues.apache.org/jira/browse/SPARK-17742 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I tried to launch an application using the below code. This is dummy code to > reproduce the problem. I tried exiting spark with status -1, throwing an > exception etc. but in no case did the listener give me failed status. But if > a spark job returns -1 or throws an exception from the main method it should > be considered as a failure. > {code} > package com.example; > import org.apache.spark.launcher.SparkAppHandle; > import org.apache.spark.launcher.SparkLauncher; > import java.io.IOException; > public class Main2 { > public static void main(String[] args) throws IOException, > InterruptedException { > SparkLauncher launcher = new SparkLauncher() > .setSparkHome("/opt/spark2") > > .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar") > .setMainClass("com.example.Main") > .setMaster("local[2]"); > launcher.startApplication(new MyListener()); > Thread.sleep(1000 * 60); > } > } > class MyListener implements SparkAppHandle.Listener { > @Override > public void stateChanged(SparkAppHandle handle) { > System.out.println("state changed " + handle.getState()); > } > @Override > public void infoChanged(SparkAppHandle handle) { > System.out.println("info changed " + handle.getState()); > } > } > {code} > The spark job is > {code} > package com.example; > import org.apache.spark.sql.SparkSession; > import java.io.IOException; > public class Main { > public static void main(String[] args) throws IOException { > SparkSession sparkSession = SparkSession > .builder() > .appName("" + System.currentTimeMillis()) > .getOrCreate(); > try { > for (int i = 0; i < 15; i++) { > Thread.sleep(1000); > System.out.println("sleeping 1"); > } > } catch (InterruptedException e) { > e.printStackTrace(); > } > //sparkSession.stop(); > System.exit(-1); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17742) Spark Launcher does not get failed state in Listener
[ https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535315#comment-15535315 ] Aseem Bansal commented on SPARK-17742: -- I dug into the launcher code to see if I can figure out how it is working and see if I could find the bug. But when I reached LauncherServer's ServerConnection's handle method and found that this is socket programming I found it harder to find where the messages are coming from. Still trying to figure out maybe someone who knows spark code better will find it easier to find the bug. > Spark Launcher does not get failed state in Listener > - > > Key: SPARK-17742 > URL: https://issues.apache.org/jira/browse/SPARK-17742 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I tried to launch an application using the below code. This is dummy code to > reproduce the problem. I tried exiting spark with status -1, throwing an > exception etc. but in no case did the listener give me failed status. But if > a spark job returns -1 or throws an exception from the main method it should > be considered as a failure. > {code} > package com.example; > import org.apache.spark.launcher.SparkAppHandle; > import org.apache.spark.launcher.SparkLauncher; > import java.io.IOException; > public class Main2 { > public static void main(String[] args) throws IOException, > InterruptedException { > SparkLauncher launcher = new SparkLauncher() > .setSparkHome("/opt/spark2") > > .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar") > .setMainClass("com.example.Main") > .setMaster("local[2]"); > launcher.startApplication(new MyListener()); > Thread.sleep(1000 * 60); > } > } > class MyListener implements SparkAppHandle.Listener { > @Override > public void stateChanged(SparkAppHandle handle) { > System.out.println("state changed " + handle.getState()); > } > @Override > public void infoChanged(SparkAppHandle handle) { > System.out.println("info changed " + handle.getState()); > } > } > {code} > The spark job is > {code} > package com.example; > import org.apache.spark.sql.SparkSession; > import java.io.IOException; > public class Main { > public static void main(String[] args) throws IOException { > SparkSession sparkSession = SparkSession > .builder() > .appName("" + System.currentTimeMillis()) > .getOrCreate(); > try { > for (int i = 0; i < 15; i++) { > Thread.sleep(1000); > System.out.println("sleeping 1"); > } > } catch (InterruptedException e) { > e.printStackTrace(); > } > //sparkSession.stop(); > System.exit(-1); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17742) Spark Launcher does not get failed state in Listener
Aseem Bansal created SPARK-17742: Summary: Spark Launcher does not get failed state in Listener Key: SPARK-17742 URL: https://issues.apache.org/jira/browse/SPARK-17742 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.0.0 Reporter: Aseem Bansal I tried to launch an application using the below code. This is dummy code to reproduce the problem. I tried exiting spark with status -1, throwing an exception etc. but in no case did the listener give me failed status. But if a spark job returns -1 or throws an exception from the main method it should be considered as a failure. {code} package com.example; import org.apache.spark.launcher.SparkAppHandle; import org.apache.spark.launcher.SparkLauncher; import java.io.IOException; public class Main2 { public static void main(String[] args) throws IOException, InterruptedException { SparkLauncher launcher = new SparkLauncher() .setSparkHome("/opt/spark2") .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar") .setMainClass("com.example.Main") .setMaster("local[2]"); launcher.startApplication(new MyListener()); Thread.sleep(1000 * 60); } } class MyListener implements SparkAppHandle.Listener { @Override public void stateChanged(SparkAppHandle handle) { System.out.println("state changed " + handle.getState()); } @Override public void infoChanged(SparkAppHandle handle) { System.out.println("info changed " + handle.getState()); } } {code} The spark job is {code} package com.example; import org.apache.spark.sql.SparkSession; import java.io.IOException; public class Main { public static void main(String[] args) throws IOException { SparkSession sparkSession = SparkSession .builder() .appName("" + System.currentTimeMillis()) .getOrCreate(); try { for (int i = 0; i < 15; i++) { Thread.sleep(1000); System.out.println("sleeping 1"); } } catch (InterruptedException e) { e.printStackTrace(); } //sparkSession.stop(); System.exit(-1); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only
[ https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495960#comment-15495960 ] Aseem Bansal commented on SPARK-17560: -- Can you share where this option needs to be set? Maybe I can try and add a pull request unless it is easier for you to just add a PR yourself instead of explaining. > SQLContext tables returns table names in lower case only > > > Key: SPARK-17560 > URL: https://issues.apache.org/jira/browse/SPARK-17560 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I registered a table using > dataSet.createOrReplaceTempView("TestTable"); > Then I tried to get the list of tables using > sparkSession.sqlContext().tableNames() > but the name that I got was testtable. It used to give table names in proper > case in Spark 1.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only
[ https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495906#comment-15495906 ] Aseem Bansal commented on SPARK-17560: -- Looked through https://spark.apache.org/docs/2.0.0/sql-programming-guide.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/SparkSession.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkConf.html and none of them say anything about this parameter > SQLContext tables returns table names in lower case only > > > Key: SPARK-17560 > URL: https://issues.apache.org/jira/browse/SPARK-17560 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I registered a table using > dataSet.createOrReplaceTempView("TestTable"); > Then I tried to get the list of tables using > sparkSession.sqlContext().tableNames() > but the name that I got was testtable. It used to give table names in proper > case in Spark 1.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems
[ https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-17561: - Description: I visited this page https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html and saw that the docs have formatting problems !screenshot-1.png! was: I visited this page https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html and saw that the docs have formatting problems > DataFrameWriter documentation formatting problems > - > > Key: SPARK-17561 > URL: https://issues.apache.org/jira/browse/SPARK-17561 > Project: Spark > Issue Type: Documentation >Reporter: Aseem Bansal > Attachments: screenshot-1.png > > > I visited this page > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html > and saw that the docs have formatting problems > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems
[ https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-17561: - Description: I visited this page https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html and saw that the docs have formatting problems !screenshot-1.png! Tried with browser cache disabled. Same issue was: I visited this page https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html and saw that the docs have formatting problems !screenshot-1.png! > DataFrameWriter documentation formatting problems > - > > Key: SPARK-17561 > URL: https://issues.apache.org/jira/browse/SPARK-17561 > Project: Spark > Issue Type: Documentation >Reporter: Aseem Bansal > Attachments: screenshot-1.png > > > I visited this page > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html > and saw that the docs have formatting problems > !screenshot-1.png! > Tried with browser cache disabled. Same issue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems
[ https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-17561: - Attachment: screenshot-1.png > DataFrameWriter documentation formatting problems > - > > Key: SPARK-17561 > URL: https://issues.apache.org/jira/browse/SPARK-17561 > Project: Spark > Issue Type: Documentation >Reporter: Aseem Bansal > Attachments: screenshot-1.png > > > I visited this page > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html > and saw that the docs have formatting problems -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17561) DataFrameWriter documentation formatting problems
Aseem Bansal created SPARK-17561: Summary: DataFrameWriter documentation formatting problems Key: SPARK-17561 URL: https://issues.apache.org/jira/browse/SPARK-17561 Project: Spark Issue Type: Documentation Reporter: Aseem Bansal Attachments: screenshot-1.png I visited this page https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html and saw that the docs have formatting problems -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only
[ https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495862#comment-15495862 ] Aseem Bansal commented on SPARK-17560: -- No I did not. Where? > SQLContext tables returns table names in lower case only > > > Key: SPARK-17560 > URL: https://issues.apache.org/jira/browse/SPARK-17560 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I registered a table using > dataSet.createOrReplaceTempView("TestTable"); > Then I tried to get the list of tables using > sparkSession.sqlContext().tableNames() > but the name that I got was testtable. It used to give table names in proper > case in Spark 1.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17560) SQLContext tables returns table names in lower case only
[ https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495862#comment-15495862 ] Aseem Bansal edited comment on SPARK-17560 at 9/16/16 9:38 AM: --- No I did not. Where? Had not set that in Spark 1.4 either was (Author: anshbansal): No I did not. Where? > SQLContext tables returns table names in lower case only > > > Key: SPARK-17560 > URL: https://issues.apache.org/jira/browse/SPARK-17560 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I registered a table using > dataSet.createOrReplaceTempView("TestTable"); > Then I tried to get the list of tables using > sparkSession.sqlContext().tableNames() > but the name that I got was testtable. It used to give table names in proper > case in Spark 1.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17560) SQLContext tables returns table names in lower case only
Aseem Bansal created SPARK-17560: Summary: SQLContext tables returns table names in lower case only Key: SPARK-17560 URL: https://issues.apache.org/jira/browse/SPARK-17560 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aseem Bansal I registered a table using dataSet.createOrReplaceTempView("TestTable"); Then I tried to get the list of tables using sparkSession.sqlContext().tableNames() but the name that I got was testtable. It used to give table names in proper case in Spark 1.4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model
[ https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1540#comment-1540 ] Aseem Bansal commented on SPARK-17307: -- Not adding it there would be fine. But there needs to be something. Also for contributions I tried searching for the file but could not. In which branch are you working? > Document what all access is needed on S3 bucket when trying to save a model > --- > > Key: SPARK-17307 > URL: https://issues.apache.org/jira/browse/SPARK-17307 > Project: Spark > Issue Type: Documentation >Reporter: Aseem Bansal >Priority: Minor > > I faced this lack of documentation when I was trying to save a model to S3. > Initially I thought it should be only write. Then I found it also needs > delete to delete temporary files. Now I requested access for delete and tried > again and I am get the error > Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: > org.jets3t.service.S3ServiceException: S3 PUT failed for > '/dev-qa_%24folder%24' XML Error Message > To reproduce this error the below can be used > {code} > SparkSession sparkSession = SparkSession > .builder() > .appName("my app") > .master("local") > .getOrCreate(); > JavaSparkContext jsc = new > JavaSparkContext(sparkSession.sparkContext()); > jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ); > jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", ACCESS KEY>); > //Create a Pipelinemode > > pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest"); > {code} > This back and forth could be avoided if it was clearly mentioned what all > access spark needs to write to S3. Also would be great if why all of the > access is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model
[ https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454791#comment-15454791 ] Aseem Bansal commented on SPARK-17307: -- I would add that bit of information at http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/util/MLWritable.html#save(java.lang.String) Something like it needs complete read write access when using with S3 should be enough. > Document what all access is needed on S3 bucket when trying to save a model > --- > > Key: SPARK-17307 > URL: https://issues.apache.org/jira/browse/SPARK-17307 > Project: Spark > Issue Type: Documentation >Reporter: Aseem Bansal >Priority: Minor > > I faced this lack of documentation when I was trying to save a model to S3. > Initially I thought it should be only write. Then I found it also needs > delete to delete temporary files. Now I requested access for delete and tried > again and I am get the error > Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: > org.jets3t.service.S3ServiceException: S3 PUT failed for > '/dev-qa_%24folder%24' XML Error Message > To reproduce this error the below can be used > {code} > SparkSession sparkSession = SparkSession > .builder() > .appName("my app") > .master("local") > .getOrCreate(); > JavaSparkContext jsc = new > JavaSparkContext(sparkSession.sparkContext()); > jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ); > jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", ACCESS KEY>); > //Create a Pipelinemode > > pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest"); > {code} > This back and forth could be avoided if it was clearly mentioned what all > access spark needs to write to S3. Also would be great if why all of the > access is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model
Aseem Bansal created SPARK-17307: Summary: Document what all access is needed on S3 bucket when trying to save a model Key: SPARK-17307 URL: https://issues.apache.org/jira/browse/SPARK-17307 Project: Spark Issue Type: Documentation Reporter: Aseem Bansal I faced this lack of documentation when I was trying to save a model to S3. Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now I requested access for delete and tried again and I am get the error Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for '/dev-qa_%24folder%24' XML Error Message To reproduce this error the below can be used {code} SparkSession sparkSession = SparkSession .builder() .appName("my app") .master("local") .getOrCreate(); JavaSparkContext jsc = new JavaSparkContext(sparkSession.sparkContext()); jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ); jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", ); //Create a Pipelinemode pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest"); {code} This back and forth could be avoided if it was clearly mentioned what all access spark needs to write to S3. Also would be great if why all of the access is needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17012) Reading data frames via CSV - Allow to specify default value for integers
[ https://issues.apache.org/jira/browse/SPARK-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-17012: - Description: Currently the option that we have in DataFrameReader is nullValue which allows us one default. But say in our data frame we have string and integers and we want to specify the default for strings and integers differently that is currently not possible. If it is done for different data types then it should be possible to allow to specify the schema to be nullable false when inferring schema (as a new option). was:Currently the option that we have in DataFrameReader is nullValue which allows us one default. But say in our data frame we have string and integers and we want to specify the default for strings and integers differently that is currently not possible. > Reading data frames via CSV - Allow to specify default value for integers > - > > Key: SPARK-17012 > URL: https://issues.apache.org/jira/browse/SPARK-17012 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > Currently the option that we have in DataFrameReader is nullValue which > allows us one default. But say in our data frame we have string and integers > and we want to specify the default for strings and integers differently that > is currently not possible. > If it is done for different data types then it should be possible to allow to > specify the schema to be nullable false when inferring schema (as a new > option). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17012) Reading data frames via CSV - Allow to specify default value for integers
Aseem Bansal created SPARK-17012: Summary: Reading data frames via CSV - Allow to specify default value for integers Key: SPARK-17012 URL: https://issues.apache.org/jira/browse/SPARK-17012 Project: Spark Issue Type: Improvement Affects Versions: 2.0.0 Reporter: Aseem Bansal Currently the option that we have in DataFrameReader is nullValue which allows us one default. But say in our data frame we have string and integers and we want to specify the default for strings and integers differently that is currently not possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented
[ https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409260#comment-15409260 ] Aseem Bansal commented on SPARK-16893: -- Yes. I would expect it to work without the use of format function as spark's documentation does not tell me anything about the need to use the format when using the csv function. > Spark CSV Provider option is not documented > --- > > Key: SPARK-16893 > URL: https://issues.apache.org/jira/browse/SPARK-16893 > Project: Spark > Issue Type: Documentation >Affects Versions: 2.0.0 >Reporter: Aseem Bansal >Priority: Minor > > I was working with databricks spark csv library and came across an error. I > have logged the issue in their github but it would be good to document that > in Apache Spark's documentation also > I faced it with CSV. Someone else faced that with JSON > http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file > Complete Issue details here > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented
[ https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409183#comment-15409183 ] Aseem Bansal commented on SPARK-16893: -- Reading a CSV causes an exception. Code used and excpetion are below. Also present in the github issue that I have referenced here. {code} public static void main(String[] args) { SparkSession spark = SparkSession .builder() .appName("my app") .getOrCreate(); Dataset df = spark.read() .format("com.databricks.spark.csv") .option("header", "true") .option("nullValue", "") .csv("/home/aseem/data.csv") ; df.show(); } {code} bq. Exception in thread "main" java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. People need to use format("csv"). I think that is counter intuitive seeing that I am using the CSV method. > Spark CSV Provider option is not documented > --- > > Key: SPARK-16893 > URL: https://issues.apache.org/jira/browse/SPARK-16893 > Project: Spark > Issue Type: Documentation >Affects Versions: 2.0.0 >Reporter: Aseem Bansal >Priority: Minor > > I was working with databricks spark csv library and came across an error. I > have logged the issue in their github but it would be good to document that > in Apache Spark's documentation also > I faced it with CSV. Someone else faced that with JSON > http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file > Complete Issue details here > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16895) Reading empty string from csv has changed behaviour
[ https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408892#comment-15408892 ] Aseem Bansal edited comment on SPARK-16895 at 8/5/16 5:19 AM: -- I see that this is duplicate. Regarding it being a bug or not I heard someone say this related to frameworks. > If a feature is not documented it does not exist. If a change is not > documented then it is a bug. was (Author: anshbansal): I understand that it is duplicate. Regarding it being a bug or not I heard someone say this. > If a feature is not documented it does not exist. If a change is not > documented then it is a bug. > Reading empty string from csv has changed behaviour > --- > > Key: SPARK-16895 > URL: https://issues.apache.org/jira/browse/SPARK-16895 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I have a file called test.csv > "a" > "" > When I read it in Spark 1.4 I get an empty string as value. When I read it in > 2.0 I get "null" as the String. > The testing code is same as mentioned at > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16895) Reading empty string from csv has changed behaviour
[ https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408892#comment-15408892 ] Aseem Bansal commented on SPARK-16895: -- I understand that it is duplicate. Regarding it being a bug or not I heard someone say this. > If a feature is not documented it does not exist. If a change is not > documented then it is a bug. > Reading empty string from csv has changed behaviour > --- > > Key: SPARK-16895 > URL: https://issues.apache.org/jira/browse/SPARK-16895 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I have a file called test.csv > "a" > "" > When I read it in Spark 1.4 I get an empty string as value. When I read it in > 2.0 I get "null" as the String. > The testing code is same as mentioned at > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16896) Loading csv with duplicate column names
Aseem Bansal created SPARK-16896: Summary: Loading csv with duplicate column names Key: SPARK-16896 URL: https://issues.apache.org/jira/browse/SPARK-16896 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aseem Bansal It would be great if the library allows us to load csv with duplicate column names. I understand that having duplicate columns in the data is odd but sometimes we get data that has duplicate columns. Getting upstream data like that can happen. We may choose to ignore them but currently there is no way to drop those as we are not able to load them at all. Currently as a pre-processing I loaded the data into R, changed the column names and then make a fixed version with which Spark Java API can work. But if talk about other options, e.g. R has read.csv which automatically takes care of such situation by appending a number to the column name. Also case sensitivity in column names can also cause problems. I mean if we have columns like ColumnName, columnName I may want to have them as separate. But the option to do this is not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16896) Loading csv with duplicate column names
[ https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407604#comment-15407604 ] Aseem Bansal commented on SPARK-16896: -- [~hyukjin.kwon] cc > Loading csv with duplicate column names > --- > > Key: SPARK-16896 > URL: https://issues.apache.org/jira/browse/SPARK-16896 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > It would be great if the library allows us to load csv with duplicate column > names. I understand that having duplicate columns in the data is odd but > sometimes we get data that has duplicate columns. Getting upstream data like > that can happen. We may choose to ignore them but currently there is no way > to drop those as we are not able to load them at all. Currently as a > pre-processing I loaded the data into R, changed the column names and then > make a fixed version with which Spark Java API can work. > But if talk about other options, e.g. R has read.csv which automatically > takes care of such situation by appending a number to the column name. > Also case sensitivity in column names can also cause problems. I mean if we > have columns like > ColumnName, columnName > I may want to have them as separate. But the option to do this is not > documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented
[ https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407601#comment-15407601 ] Aseem Bansal commented on SPARK-16893: -- [~hyukjin.kwon] cc > Spark CSV Provider option is not documented > --- > > Key: SPARK-16893 > URL: https://issues.apache.org/jira/browse/SPARK-16893 > Project: Spark > Issue Type: Documentation >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I was working with databricks spark csv library and came across an error. I > have logged the issue in their github but it would be good to document that > in Apache Spark's documentation also > I faced it with CSV. Someone else faced that with JSON > http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file > Complete Issue details here > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16895) Reading empty string from csv has changed behaviour
[ https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407602#comment-15407602 ] Aseem Bansal commented on SPARK-16895: -- [~hyukjin.kwon] cc > Reading empty string from csv has changed behaviour > --- > > Key: SPARK-16895 > URL: https://issues.apache.org/jira/browse/SPARK-16895 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I have a file called test.csv > "a" > "" > When I read it in Spark 1.4 I get an empty string as value. When I read it in > 2.0 I get "null" as the String. > The testing code is same as mentioned at > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16895) Reading empty string from csv has changed behaviour
Aseem Bansal created SPARK-16895: Summary: Reading empty string from csv has changed behaviour Key: SPARK-16895 URL: https://issues.apache.org/jira/browse/SPARK-16895 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aseem Bansal I have a file called test.csv "a" "" When I read it in Spark 1.4 I get an empty string as value. When I read it in 2.0 I get "null" as the String. The testing code is same as mentioned at https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16893) Spark CSV Provider option is not documented
[ https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-16893: - Description: I was working with databricks spark csv library and came across an error. I have logged the issue in their github but it would be good to document that in Apache Spark's documentation also I faced it with CSV. Someone else faced that with JSON http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file Complete Issue details here https://github.com/databricks/spark-csv/issues/367 was: I was working with databricks spark csv library and came across an error. I have logged the issue in their github but it would be good to document that in Apache Spark's documentation also Details here https://github.com/databricks/spark-csv/issues/367 > Spark CSV Provider option is not documented > --- > > Key: SPARK-16893 > URL: https://issues.apache.org/jira/browse/SPARK-16893 > Project: Spark > Issue Type: Documentation >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > > I was working with databricks spark csv library and came across an error. I > have logged the issue in their github but it would be good to document that > in Apache Spark's documentation also > I faced it with CSV. Someone else faced that with JSON > http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file > Complete Issue details here > https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16893) Spark CSV Provider option is not documented
Aseem Bansal created SPARK-16893: Summary: Spark CSV Provider option is not documented Key: SPARK-16893 URL: https://issues.apache.org/jira/browse/SPARK-16893 Project: Spark Issue Type: Documentation Affects Versions: 2.0.0 Reporter: Aseem Bansal I was working with databricks spark csv library and came across an error. I have logged the issue in their github but it would be good to document that in Apache Spark's documentation also Details here https://github.com/databricks/spark-csv/issues/367 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9678) HTTP request to BlockManager port yields exception
[ https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662321#comment-14662321 ] Aseem Bansal commented on SPARK-9678: - I understand. Just thought to mention that. > HTTP request to BlockManager port yields exception > -- > > Key: SPARK-9678 > URL: https://issues.apache.org/jira/browse/SPARK-9678 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.4.1 > Environment: Ubuntu 14.0.4 >Reporter: Aseem Bansal >Priority: Minor > > I was going through the quick start for spark 1.4.1 at > http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. > Also the exact version that I am using is spark-1.4.1-bin-hadoop2.4 > The quick start has textFile = sc.textFile("README.md"). I ran that and then > the following text appeared in the command line > {noformat} > 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with > curMem=0, maxMem=278302556 > 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 140.5 KB, free 265.3 MB) > 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with > curMem=143840, maxMem=278302556 > 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes > in memory (estimated size 12.3 KB, free 265.3 MB) > 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on localhost:53311 (size: 12.3 KB, free: 265.4 MB) > 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at > NativeMethodAccessorImpl.java:-2 > {noformat} > I saw that there was an IP in these logs i.e. localhost:53311 > I tried connecting to it via Google Chrome and got an exception. > {noformat} > >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection > >>> from /127.0.0.1:54056 > io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds > 2147483647: 5135603447292250196 - discarded > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) > at > io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9678) Exception while going through quick start
[ https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-9678: Description: I was going through the quick start for spark 1.4.1 at http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also the exact version that I am using is spark-1.4.1-bin-hadoop2.4 The quick start has textFile = sc.textFile("README.md"). I ran that and then the following text appeared in the command line {noformat} 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with curMem=0, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 140.5 KB, free 265.3 MB) 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=143840, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.3 MB) 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53311 (size: 12.3 KB, free: 265.4 MB) 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 {noformat} I saw that there was an IP in these logs i.e. localhost:53311 I tried connecting to it via Google Chrome and got an exception. {noformat} >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection >>> from /127.0.0.1:54056 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 5135603447292250196 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) {noformat} was: I was going through the quick start for spark 1.4.1 at http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also the exact version that I am using is spark-1.4.1-bin-hadoop2.4 The quick start has textFile = sc.textFile("README.md"). I ran that and then the following text appeared in the command line 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with curMem=0, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 140.5 KB, free 265.3 MB) 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=143840, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.3 MB) 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53311 (size: 12.3 KB, free: 265.4 MB) 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 I saw that there was an IP in these logs i.e. localhost:53311 I tried connecting to it via Google Chrome and got an exception. >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection >>> from /127.0.0.1:54056 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 5135603447292250196 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.de
[jira] [Updated] (SPARK-9678) Exception while going through quick start
[ https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aseem Bansal updated SPARK-9678: Description: I was going through the quick start for spark 1.4.1 at http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also the exact version that I am using is spark-1.4.1-bin-hadoop2.4 The quick start has textFile = sc.textFile("README.md"). I ran that and then the following text appeared in the command line 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with curMem=0, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 140.5 KB, free 265.3 MB) 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=143840, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.3 MB) 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53311 (size: 12.3 KB, free: 265.4 MB) 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 I saw that there was an IP in these logs i.e. localhost:53311 I tried connecting to it via Google Chrome and got an exception. >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection >>> from /127.0.0.1:54056 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 5135603447292250196 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) was: I was going through the quick start for spark 1.4.1 at http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark The quick start has textFile = sc.textFile("README.md"). I ran that and then the following text appeared in the command line 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with curMem=0, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 140.5 KB, free 265.3 MB) 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=143840, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.3 MB) 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53311 (size: 12.3 KB, free: 265.4 MB) 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 I saw that there was an IP in these logs i.e. localhost:53311 I tried connecting to it via Google Chrome and got an exception. >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection >>> from /127.0.0.1:54056 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 5135603447292250196 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(
[jira] [Created] (SPARK-9678) Exception while going through quick start
Aseem Bansal created SPARK-9678: --- Summary: Exception while going through quick start Key: SPARK-9678 URL: https://issues.apache.org/jira/browse/SPARK-9678 Project: Spark Issue Type: Bug Affects Versions: 1.4.1 Environment: Ubuntu 14.0.4 Reporter: Aseem Bansal I was going through the quick start for spark 1.4.1 at http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark The quick start has textFile = sc.textFile("README.md"). I ran that and then the following text appeared in the command line 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with curMem=0, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 140.5 KB, free 265.3 MB) 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=143840, maxMem=278302556 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.3 MB) 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53311 (size: 12.3 KB, free: 265.4 MB) 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 I saw that there was an IP in these logs i.e. localhost:53311 I tried connecting to it via Google Chrome and got an exception. >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection >>> from /127.0.0.1:54056 io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 5135603447292250196 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org