from:"Aseem Bansal \(JIRA\)"

[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094607#comment-16094607
 ] 

Aseem Bansal edited comment on SPARK-21483 at 7/20/17 12:29 PM:


Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data. Raw 
data is just a java bean
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}



was (Author: anshbansal):
Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}


> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094607#comment-16094607
 ] 

Aseem Bansal commented on SPARK-21483:
--

Some pseudo code to show what I am trying to achieve

{code:java}
class MyTransformer implemenets Serializable {

   public FeaturesAndLabel transform(RawData rawData) {
//Some logic which creates Features and Labels from raw data
   //FeaturesAndLabel is a bean which contains a SparseVector as 
features, and double as label
   }
}
{code}

{code:java}

Dataset dataset = //read from somewhere and create Dataset of RawData 
bean
Dataset featuresAndLabels = dataset.transform(new 
MyTransformer()::transform)

//use features and labels for machine learning
{code}


> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094314#comment-16094314
 ] 

Aseem Bansal edited comment on SPARK-21483 at 7/20/17 9:11 AM:
---

No it does not. Can you give a link to what you are referring to? And I am not 
using spark SQL. I am using Dataset's transformations only.


was (Author: anshbansal):
Now it does not. Can you give a link to what you are referring to? And I am not 
using spark SQL. I am using Dataset's transformations only.

> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094315#comment-16094315
 ] 

Aseem Bansal commented on SPARK-21482:
--

There is a LabeledPoint in new ml api too 
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint

I am able to workaround via using my own class. But I thought the ML package 
was supposed to be used with the dataset's API. That's why I am saying it 
should support this.



> Make LabeledPoint bean-compliant so it can be used in 
> Encoders.bean(LabeledPoint.class)
> ---
>
> Key: SPARK-21482
> URL: https://issues.apache.org/jira/browse/SPARK-21482
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The LabeledPoint class is currently not bean-compliant as per spark
> https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint
> This makes it impossible to create a LabeledPoint via a dataset.tranform. It 
> should be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094314#comment-16094314
 ] 

Aseem Bansal commented on SPARK-21483:
--

Now it does not. Can you give a link to what you are referring to? And I am not 
using spark SQL. I am using Dataset's transformations only.

> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094297#comment-16094297
 ] 

Aseem Bansal commented on SPARK-21483:
--

How would you encode it otherwise?

> Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in 
> Encoders.bean(Vector.class)
> --
>
> Key: SPARK-21483
> URL: https://issues.apache.org/jira/browse/SPARK-21483
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant 
> as per spark.
> This makes it impossible to create a Vector via a dataset.tranform. It should 
> be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)

2017-07-20 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094295#comment-16094295
 ] 

Aseem Bansal commented on SPARK-21482:
--

I am using Java API. I tried a simple transformation with 

{noformat}
dataset.transform(MyCustomToLabeledPointTransformer::transformer, 
Encoders.bean(LabeledPoint.class))
{noformat}

and it threw bean-compliance exception. I am not sure whether the encoders 
should act on beans or not but clearly something is going on due to which they 
are acting on beans.

> Make LabeledPoint bean-compliant so it can be used in 
> Encoders.bean(LabeledPoint.class)
> ---
>
> Key: SPARK-21482
> URL: https://issues.apache.org/jira/browse/SPARK-21482
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The LabeledPoint class is currently not bean-compliant as per spark
> https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint
> This makes it impossible to create a LabeledPoint via a dataset.tranform. It 
> should be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21483) Make org.apache.spark.ml.linalg.Vector bean-compliant so it can be used in Encoders.bean(Vector.class)

2017-07-19 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-21483:


 Summary: Make org.apache.spark.ml.linalg.Vector bean-compliant so 
it can be used in Encoders.bean(Vector.class)
 Key: SPARK-21483
 URL: https://issues.apache.org/jira/browse/SPARK-21483
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Aseem Bansal


The class org.apache.spark.ml.linalg.Vector is currently not bean-compliant as 
per spark.

This makes it impossible to create a Vector via a dataset.tranform. It should 
be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21482) Make LabeledPoint bean-compliant so it can be used in Encoders.bean(LabeledPoint.class)

2017-07-19 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-21482:


 Summary: Make LabeledPoint bean-compliant so it can be used in 
Encoders.bean(LabeledPoint.class)
 Key: SPARK-21482
 URL: https://issues.apache.org/jira/browse/SPARK-21482
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Aseem Bansal


The LabeledPoint class is currently not bean-compliant as per spark
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.feature.LabeledPoint

This makes it impossible to create a LabeledPoint via a dataset.tranform. It 
should be made bean-compliant so it can be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2017-07-19 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-21481:


 Summary: Add indexOf method in ml.feature.HashingTF similar to 
mllib.feature.HashingTF
 Key: SPARK-21481
 URL: https://issues.apache.org/jira/browse/SPARK-21481
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0, 2.1.0
Reporter: Aseem Bansal


If we want to find the index of any input based on hashing trick then it is 
possible in 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
 but not in 
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.

Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21473) Running Transform on a bean which has only setters gives NullPointerExcpetion

2017-07-19 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-21473:
-
Description: 
If I run the following using the Java API

{code:java}
dataset.map(Transformer::transform, 
Encoders.bean(BeanWithOnlySettersAndNoGetters.class));
{code}

Then I get the below exception. I understand that it is not bean-compliant 
without the getters but the exception is wrong. Perhaps fixing the exception 
message would be a solution?

{noformat}
Caused by: java.lang.NullPointerException
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
{noformat}


  was:
If I run the following 

{code:java}
dataset.map(Transformer::transform, 
Encoders.bean(BeanWithOnlySettersAndNoGetters.class));
{code}

Then I get the below exception. I understand that it is not bean-compliant 
without the getters but the exception is wrong. Perhaps fixing the exception 
message would be a solution?

{noformat}
Caused by: java.lang.NullPointerException
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
{noformat}



> Running Transform on a bean which has only setters gives NullPointerExcpetion
> -
>
> Key: SPARK-21473
> URL: https://issues.apache.org/jira/browse/SPARK-21473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> If I run the following using the Java API
> {code:java}
> dataset.map(Transformer::transform, 
> Encoders.bean(BeanWithOnlySettersAndNoGetters.class));
> {code}
> Then I get the below exception. I understand that it is not bean-compliant 
> without the getters but the exception is wrong. Perhaps fixing the exception 
> message would be a solution?
> {noformat}
> Caused by: java.lang.NullPointerException
>   at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(I

[jira] [Created] (SPARK-21473) Running Transform on a bean which has only setters gives NullPointerExcpetion

2017-07-19 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-21473:


 Summary: Running Transform on a bean which has only setters gives 
NullPointerExcpetion
 Key: SPARK-21473
 URL: https://issues.apache.org/jira/browse/SPARK-21473
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Aseem Bansal


If I run the following 

{code:java}
dataset.map(Transformer::transform, 
Encoders.bean(BeanWithOnlySettersAndNoGetters.class));
{code}

Then I get the below exception. I understand that it is not bean-compliant 
without the getters but the exception is wrong. Perhaps fixing the exception 
message would be a solution?

{noformat}
Caused by: java.lang.NullPointerException
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
at 
org.apache.spark.sql.catalyst.JavaTypeInference$.inferDataType(JavaTypeInference.scala:55)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:89)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17742) Spark Launcher does not get failed state in Listener

2017-04-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954714#comment-15954714
 ] 

Aseem Bansal commented on SPARK-17742:
--

[~daanvdn] We ended up using kafka messages to communicate to the web app that 
was using the launcher to launch the job  whether the job was complete or 
failed. Dumped Launcher's states as they are broken.

> Spark Launcher does not get failed state in Listener 
> -
>
> Key: SPARK-17742
> URL: https://issues.apache.org/jira/browse/SPARK-17742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I tried to launch an application using the below code. This is dummy code to 
> reproduce the problem. I tried exiting spark with status -1, throwing an 
> exception etc. but in no case did the listener give me failed status. But if 
> a spark job returns -1 or throws an exception from the main method it should 
> be considered as a failure. 
> {code}
> package com.example;
> import org.apache.spark.launcher.SparkAppHandle;
> import org.apache.spark.launcher.SparkLauncher;
> import java.io.IOException;
> public class Main2 {
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("/opt/spark2")
> 
> .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
> .setMainClass("com.example.Main")
> .setMaster("local[2]");
> launcher.startApplication(new MyListener());
> Thread.sleep(1000 * 60);
> }
> }
> class MyListener implements SparkAppHandle.Listener {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> System.out.println("state changed " + handle.getState());
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> System.out.println("info changed " + handle.getState());
> }
> }
> {code}
> The spark job is 
> {code}
> package com.example;
> import org.apache.spark.sql.SparkSession;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) throws IOException {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("" + System.currentTimeMillis())
> .getOrCreate();
> try {
> for (int i = 0; i < 15; i++) {
> Thread.sleep(1000);
> System.out.println("sleeping 1");
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> //sparkSession.stop();
> System.exit(-1);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10413) ML models should support prediction on single instances

2017-02-08 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857888#comment-15857888
 ] 

Aseem Bansal commented on SPARK-10413:
--

Something to look at would be https://github.com/combust/mleap which provides 
this on top of spark

> ML models should support prediction on single instances
> ---
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-05 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853629#comment-15853629
 ] 

Aseem Bansal commented on SPARK-19449:
--

[~sowen] My results are actually deterministic. No matter how many times I run 
it the number of true positives, true negatives, false positives, false 
negatives are always exactly the same. The problem is that they are always 
inconsistent too by exactly the same amount in the 2 implementations.

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
> "");
> }
> static Dataset getDummyData(SparkSession spark, int numberRows, int 
> numberFeatures, int labelUpperBound) {
> StructType schema = new StructType(new StructField[]{
> new StructField("label",

[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851588#comment-15851588
 ] 

Aseem Bansal commented on SPARK-19444:
--

https://github.com/apache/spark/pull/16789

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851576#comment-15851576
 ] 

Aseem Bansal commented on SPARK-19449:
--

Isn't the decision tree debug string print it as a series of IF-ELSE? I printed 
the debug string for the 2 random forest models and it was exactly the same. In 
other words the 2 implementations should be mathematically equivalent. 

The random processes for selecting data should not cause any issues as I 
ensured that the exact same data is going to both versions. It works for 
decision trees and random forest classifier is just majority vote of bunch of 
decision trees classifiers so I cannot see how that could be different.

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}
> }, labels, features) + "\n" +
> "\n" +
> "** RandomForestClassifier\n" +
> "** Original **" + getInfo(model2, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {double 
> predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
> features) + "\n" +
> "\n" +
>

[jira] [Commented] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851568#comment-15851568
 ] 

Aseem Bansal commented on SPARK-19449:
--

[~srowen]
I removed some extra code. The part where I did the conversion is at the end in 
convertRandomForestModel method.

Basically the above code does this
- Prepare 1000 rows of data with 3 features randomly. Prepare 1000 labels 
randomly. I am not working on creating the model but the conversion. So having 
random data is not an issue. It will just be a horrible model.
- Split the data in 80/20 ratio for training/test
- train ml version of decision tree model and random forest model using the 
training set. Let's call them DT1 and RF1
- convert these to mllib version of the models. Let's call them DT2 and RF2
- Use the test set to predict labels using DT1, DT2, RF1, RF2. 
- Compare predicted labels DT1 with DT2. Same results
- Compare predicted labels RF1 with RF2. Different results.

There should not be any random results here as I have used seeds for random 
number generators everywhere and then used the exact same data for doing 
predictions using all 4 models. 

> Inconsistent results between ml package RandomForestClassificationModel and 
> mllib package RandomForestModel
> ---
>
> Key: SPARK-19449
> URL: https://issues.apache.org/jira/browse/SPARK-19449
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>
> I worked on some code to convert ml package RandomForestClassificationModel 
> to mllib package RandomForestModel. It was needed because we need to make 
> predictions on the order of ms. I found that the results are inconsistent 
> although the underlying DecisionTreeModel are exactly the same. So the 
> behavior between the 2 implementations is inconsistent which should not be 
> the case.
> The below code can be used to reproduce the issue. Can run this as a simple 
> Java app as long as you have spark dependencies set up properly.
> {noformat}
> import org.apache.spark.ml.Transformer;
> import org.apache.spark.ml.classification.*;
> import org.apache.spark.ml.linalg.*;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.mllib.linalg.DenseVector;
> import org.apache.spark.mllib.linalg.Vector;
> import org.apache.spark.mllib.tree.configuration.Algo;
> import org.apache.spark.mllib.tree.model.DecisionTreeModel;
> import org.apache.spark.mllib.tree.model.RandomForestModel;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import scala.Enumeration;
> import java.util.ArrayList;
> import java.util.List;
> import java.util.Random;
> abstract class Predictor {
> abstract double predict(Vector vector);
> }
> public class MainConvertModels {
> public static final int seed = 42;
> public static void main(String[] args) {
> int numRows = 1000;
> int numFeatures = 3;
> int numClasses = 2;
> double trainFraction = 0.8;
> double testFraction = 0.2;
> SparkSession spark = SparkSession.builder()
> .appName("conversion app")
> .master("local")
> .getOrCreate();
> Dataset data = getDummyData(spark, numRows, numFeatures, 
> numClasses);
> Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
> testFraction}, seed);
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
> testData.cache();
> List labels = getLabels(testData);
> List features = getFeatures(testData);
> DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
> DecisionTreeClassificationModel model1 = 
> classifier1.fit(trainingData);
> final DecisionTreeModel convertedModel1 = 
> convertDecisionTreeModel(model1, Algo.Classification());
> RandomForestClassifier classifier = new RandomForestClassifier();
> RandomForestClassificationModel model2 = classifier.fit(trainingData);
> final RandomForestModel convertedModel2 = 
> convertRandomForestModel(model2);
> System.out.println(
> "** DecisionTreeClassifier\n" +
> "** Original **" + getInfo(model1, testData) + "\n" +
> "** New  **" + getInfo(new Predictor() {
> double predict(Vector vector) {return 
> convertedModel1.predict(vector);}

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. So the behavior 
between the 2 implementations is inconsistent which should not be the case.

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();


Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);

System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +
"");
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.empty())
});

double[][] vectors = prepareData(numberRows, numberFeatures);

Random random = new Random(seed);
List dataTest = new ArrayList<>();
for (double[] vector : vectors) {
double label = (double) random.nextInt(2);
dataTest.add(RowFactory.create(label, Vectors.dense(vector)));
}

return spark.createDataFrame(dataTest, schema);
}

static double[][] prepareData(int numRows, int numFeatures) {

Random random = new Random(seed);

double[][] result = new double[numRows][numFeatures];

for (int row = 0; row < numRows; row++) {
for (int feature = 0; feature < numFeatures; feature++) {
result[row][feature] = random.nextDouble();
}
}

return result;
}

static S

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. 

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.empty())
});

double[][] vectors = prepareData(numberRows, numberFeatures);

[jira] [Updated] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19449:
-
Description: 
I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. So the behavior 
between the 2 implementations is inconsistent which should not be the case.

The below code can be used to reproduce the issue. Can run this as a simple 
Java app as long as you have spark dependencies set up properly.

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
new StructField("features", new VectorUDT(), false, 
Metadata.em

[jira] [Created] (SPARK-19449) Inconsistent results between ml package RandomForestClassificationModel and mllib package RandomForestModel

2017-02-03 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-19449:


 Summary: Inconsistent results between ml package 
RandomForestClassificationModel and mllib package RandomForestModel
 Key: SPARK-19449
 URL: https://issues.apache.org/jira/browse/SPARK-19449
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Aseem Bansal


I worked on some code to convert ml package RandomForestClassificationModel to 
mllib package RandomForestModel. It was needed because we need to make 
predictions on the order of ms. I found that the results are inconsistent 
although the underlying DecisionTreeModel are exactly the same. 

The below code can be used to reproduce the issue. 

{noformat}
import org.apache.spark.ml.Transformer;
import org.apache.spark.ml.classification.*;
import org.apache.spark.ml.linalg.*;
import org.apache.spark.ml.regression.RandomForestRegressionModel;
import org.apache.spark.mllib.linalg.DenseVector;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.tree.configuration.Algo;
import org.apache.spark.mllib.tree.model.DecisionTreeModel;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Enumeration;

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

abstract class Predictor {
abstract double predict(Vector vector);
}

public class MainConvertModels {

public static final int seed = 42;

public static void main(String[] args) {

int numRows = 1000;
int numFeatures = 3;
int numClasses = 2;

double trainFraction = 0.8;
double testFraction = 0.2;


SparkSession spark = SparkSession.builder()
.appName("conversion app")
.master("local")
.getOrCreate();

//Dataset data = getData(spark, "libsvm", 
"/opt/spark2/data/mllib/sample_libsvm_data.txt");
Dataset data = getDummyData(spark, numRows, numFeatures, 
numClasses);

Dataset[] splits = data.randomSplit(new double[]{trainFraction, 
testFraction}, seed);
Dataset trainingData = splits[0];
Dataset testData = splits[1];
testData.cache();

List labels = getLabels(testData);
List features = getFeatures(testData);

DecisionTreeClassifier classifier1 = new DecisionTreeClassifier();
DecisionTreeClassificationModel model1 = classifier1.fit(trainingData);
final DecisionTreeModel convertedModel1 = 
convertDecisionTreeModel(model1, Algo.Classification());


RandomForestClassifier classifier = new RandomForestClassifier();
RandomForestClassificationModel model2 = classifier.fit(trainingData);
final RandomForestModel convertedModel2 = 
convertRandomForestModel(model2);


LogisticRegression lr = new LogisticRegression();
LogisticRegressionModel model3 = lr.fit(trainingData);
final org.apache.spark.mllib.classification.LogisticRegressionModel 
convertedModel3 = convertLogisticRegressionModel(model3);


System.out.println(

"** DecisionTreeClassifier\n" +
"** Original **" + getInfo(model1, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {
double predict(Vector vector) {return 
convertedModel1.predict(vector);}
}, labels, features) + "\n" +

"\n" +

"** RandomForestClassifier\n" +
"** Original **" + getInfo(model2, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) {return convertedModel2.predict(vector);}}, labels, 
features) + "\n" +

"\n" +

"** LogisticRegression\n" +
"** Original **" + getInfo(model3, testData) + "\n" +
"** New  **" + getInfo(new Predictor() {double 
predict(Vector vector) { return convertedModel3.predict(vector);}}, labels, 
features) + "\n" +

"");
}

static Dataset getData(SparkSession spark, String format, String 
location) {

return spark.read()
.format(format)
.load(location);
}

static Dataset getDummyData(SparkSession spark, int numberRows, int 
numberFeatures, int labelUpperBound) {

StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, 
Metadata.empty()),
n

[jira] [Created] (SPARK-19445) Please remove tylerchap...@yahoo-inc.com subscription from u...@spark.apache.org

2017-02-03 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-19445:


 Summary: Please remove tylerchap...@yahoo-inc.com subscription 
from u...@spark.apache.org
 Key: SPARK-19445
 URL: https://issues.apache.org/jira/browse/SPARK-19445
 Project: Spark
  Issue Type: IT Help
  Components: Project Infra
Affects Versions: 2.1.0
Reporter: Aseem Bansal


Whenever a mail is sent to u...@spark.apache.org I receive this email

{noformat}
This is an automatically generated message.

tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and someone 
will follow up with you shortly.

If you require assistance with a legal matter, please send a message to 
legal-noti...@yahoo-inc.com

Thank you!
{noformat}

It is clear that this user is no longer available. Please remove this email 
address from mailing list so we don't get so much spam.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851382#comment-15851382
 ] 

Aseem Bansal commented on SPARK-19444:
--

[~srowen]
I can find the location at 
https://github.com/apache/spark/blob/master/docs/ml-features.md
which led me to 
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java#L40
 

but what is 
$example on:untyped_ops$

The imports are there. But seems this is broken. Then this is probably a 
parsing issue?

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-19444:
-
Priority: Minor  (was: Major)

> Tokenizer example does not compile without extra imports
> 
>
> Key: SPARK-19444
> URL: https://issues.apache.org/jira/browse/SPARK-19444
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
> does not compile without the following static import
> import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19444) Tokenizer example does not compile without extra imports

2017-02-03 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-19444:


 Summary: Tokenizer example does not compile without extra imports
 Key: SPARK-19444
 URL: https://issues.apache.org/jira/browse/SPARK-19444
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Aseem Bansal


The example at http://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer 
does not compile without the following static import

import static org.apache.spark.sql.functions.*;



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-19410) Links to API documentation are broken

2017-01-31 Thread Aseem Bansal (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Aseem Bansal created an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Spark /  SPARK-19410 
 
 
 
  Links to API documentation are broken  
 
 
 
 
 
 
 
 
 

Issue Type:
 
  Documentation 
 
 
 

Affects Versions:
 

 2.1.0 
 
 
 

Assignee:
 

 Unassigned 
 
 
 

Components:
 

 Documentation 
 
 
 

Created:
 

 31/Jan/17 08:55 
 
 
 

Priority:
 
  Major 
 
 
 

Reporter:
 
 Aseem Bansal 
 
 
 
 
 
 
 
 
 
 
I was looking at https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param and saw that the links to API documentation are broken 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment

[jira] [Comment Edited] (SPARK-10413) Model should support prediction on single instance

2016-12-08 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734495#comment-15734495
 ] 

Aseem Bansal edited comment on SPARK-10413 at 12/9/16 6:39 AM:
---

Hi

Is anyone working on this? And is there a JIRA ticket for having a predict 
method on PipelineModel?


was (Author: anshbansal):
Hi

Is anyone working on this?

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10413) Model should support prediction on single instance

2016-12-08 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15734495#comment-15734495
 ] 

Aseem Bansal commented on SPARK-10413:
--

Hi

Is anyone working on this?

> Model should support prediction on single instance
> --
>
> Key: SPARK-10413
> URL: https://issues.apache.org/jira/browse/SPARK-10413
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Currently models in the pipeline API only implement transform(DataFrame). It 
> would be quite useful to support prediction on single instance.
> UPDATE: This issue is for making predictions with single models.  We can make 
> methods like {{def predict(features: Vector): Double}} public.
> * This issue is *not* for single-instance prediction for full Pipelines, 
> which would require making predictions on {{Row}}s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18241) If Spark Launcher fails to startApplication then handle's state does not change

2016-11-03 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631898#comment-15631898
 ] 

Aseem Bansal commented on SPARK-18241:
--

Looking at the source code after mainClass = Utils.classForName(childMainClass) 
at 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L695
 I see that the exceptions are being printed instead of being thrown/sent to 
the listeners.

The API says that startApplication is preferred but the various failures need 
to be sent via the handlers otherwise the listener API is not useful. Another 
case where failures are not sent via the Launcher API 
https://issues.apache.org/jira/browse/SPARK-17742

> If Spark Launcher fails to startApplication then handle's state does not 
> change
> ---
>
> Key: SPARK-18241
> URL: https://issues.apache.org/jira/browse/SPARK-18241
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I am using Spark 2.0.0. I am using 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html
>  to submit my job. 
> If there is a failure after launcher's startapplication has been called but 
> before the spark job has actually started (i.e. in starting the spark process 
> that submits the job itself) there is 
> * no exception in the main thread that is submitting the job 
> * no exception in the job as it has not started
> * no state change of the launcher
> * the exception is logged in the error stream on the default logger name that 
> spark produces using the Job's main class.
> Basically, it is not possible to catch an exception if it happens during that 
> time. The easiest way to reproduce it is to delete the JAR file or use an 
> invalid spark home while launching the job using sparkLauncher. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18241) If Spark Launcher fails to startApplication then handle's state does not change

2016-11-02 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-18241:


 Summary: If Spark Launcher fails to startApplication then handle's 
state does not change
 Key: SPARK-18241
 URL: https://issues.apache.org/jira/browse/SPARK-18241
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I am using Spark 2.0.0. I am using 
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html
 to submit my job. 

If there is a failure after launcher's startapplication has been called but 
before the spark job has actually started (i.e. in starting the spark process 
that submits the job itself) there is 
* no exception in the main thread that is submitting the job 
* no exception in the job as it has not started
* no state change of the launcher
* the exception is logged in the error stream on the default logger name that 
spark produces using the Job's main class.

Basically, it is not possible to catch an exception if it happens during that 
time. The easiest way to reproduce it is to delete the JAR file or use an 
invalid spark home while launching the job using sparkLauncher. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17742) Spark Launcher does not get failed state in Listener

2016-09-30 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535315#comment-15535315
 ] 

Aseem Bansal edited comment on SPARK-17742 at 9/30/16 7:35 AM:
---

I dug into the launcher code to see if I can figure out how it is working and 
see if I could find the bug. But when I reached LauncherServer's 
ServerConnection's handle method and found that this is socket programming I 
found it harder to find where the messages are coming from. Still trying to 
figure out but maybe someone who knows spark code better will find it easier to 
find the bug.


was (Author: anshbansal):
I dug into the launcher code to see if I can figure out how it is working and 
see if I could find the bug. But when I reached LauncherServer's 
ServerConnection's handle method and found that this is socket programming I 
found it harder to find where the messages are coming from. Still trying to 
figure out maybe someone who knows spark code better will find it easier to 
find the bug.

> Spark Launcher does not get failed state in Listener 
> -
>
> Key: SPARK-17742
> URL: https://issues.apache.org/jira/browse/SPARK-17742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I tried to launch an application using the below code. This is dummy code to 
> reproduce the problem. I tried exiting spark with status -1, throwing an 
> exception etc. but in no case did the listener give me failed status. But if 
> a spark job returns -1 or throws an exception from the main method it should 
> be considered as a failure. 
> {code}
> package com.example;
> import org.apache.spark.launcher.SparkAppHandle;
> import org.apache.spark.launcher.SparkLauncher;
> import java.io.IOException;
> public class Main2 {
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("/opt/spark2")
> 
> .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
> .setMainClass("com.example.Main")
> .setMaster("local[2]");
> launcher.startApplication(new MyListener());
> Thread.sleep(1000 * 60);
> }
> }
> class MyListener implements SparkAppHandle.Listener {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> System.out.println("state changed " + handle.getState());
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> System.out.println("info changed " + handle.getState());
> }
> }
> {code}
> The spark job is 
> {code}
> package com.example;
> import org.apache.spark.sql.SparkSession;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) throws IOException {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("" + System.currentTimeMillis())
> .getOrCreate();
> try {
> for (int i = 0; i < 15; i++) {
> Thread.sleep(1000);
> System.out.println("sleeping 1");
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> //sparkSession.stop();
> System.exit(-1);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17742) Spark Launcher does not get failed state in Listener

2016-09-30 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535315#comment-15535315
 ] 

Aseem Bansal commented on SPARK-17742:
--

I dug into the launcher code to see if I can figure out how it is working and 
see if I could find the bug. But when I reached LauncherServer's 
ServerConnection's handle method and found that this is socket programming I 
found it harder to find where the messages are coming from. Still trying to 
figure out maybe someone who knows spark code better will find it easier to 
find the bug.

> Spark Launcher does not get failed state in Listener 
> -
>
> Key: SPARK-17742
> URL: https://issues.apache.org/jira/browse/SPARK-17742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I tried to launch an application using the below code. This is dummy code to 
> reproduce the problem. I tried exiting spark with status -1, throwing an 
> exception etc. but in no case did the listener give me failed status. But if 
> a spark job returns -1 or throws an exception from the main method it should 
> be considered as a failure. 
> {code}
> package com.example;
> import org.apache.spark.launcher.SparkAppHandle;
> import org.apache.spark.launcher.SparkLauncher;
> import java.io.IOException;
> public class Main2 {
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("/opt/spark2")
> 
> .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
> .setMainClass("com.example.Main")
> .setMaster("local[2]");
> launcher.startApplication(new MyListener());
> Thread.sleep(1000 * 60);
> }
> }
> class MyListener implements SparkAppHandle.Listener {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> System.out.println("state changed " + handle.getState());
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> System.out.println("info changed " + handle.getState());
> }
> }
> {code}
> The spark job is 
> {code}
> package com.example;
> import org.apache.spark.sql.SparkSession;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) throws IOException {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("" + System.currentTimeMillis())
> .getOrCreate();
> try {
> for (int i = 0; i < 15; i++) {
> Thread.sleep(1000);
> System.out.println("sleeping 1");
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> //sparkSession.stop();
> System.exit(-1);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17742) Spark Launcher does not get failed state in Listener

2016-09-30 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17742:


 Summary: Spark Launcher does not get failed state in Listener 
 Key: SPARK-17742
 URL: https://issues.apache.org/jira/browse/SPARK-17742
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I tried to launch an application using the below code. This is dummy code to 
reproduce the problem. I tried exiting spark with status -1, throwing an 
exception etc. but in no case did the listener give me failed status. But if a 
spark job returns -1 or throws an exception from the main method it should be 
considered as a failure. 

{code}
package com.example;

import org.apache.spark.launcher.SparkAppHandle;
import org.apache.spark.launcher.SparkLauncher;

import java.io.IOException;

public class Main2 {

public static void main(String[] args) throws IOException, 
InterruptedException {
SparkLauncher launcher = new SparkLauncher()
.setSparkHome("/opt/spark2")

.setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
.setMainClass("com.example.Main")
.setMaster("local[2]");

launcher.startApplication(new MyListener());

Thread.sleep(1000 * 60);
}

}

class MyListener implements SparkAppHandle.Listener {

@Override
public void stateChanged(SparkAppHandle handle) {

System.out.println("state changed " + handle.getState());
}

@Override
public void infoChanged(SparkAppHandle handle) {
System.out.println("info changed " + handle.getState());
}
}
{code}

The spark job is 
{code}
package com.example;

import org.apache.spark.sql.SparkSession;
import java.io.IOException;

public class Main {

public static void main(String[] args) throws IOException {
SparkSession sparkSession = SparkSession
.builder()
.appName("" + System.currentTimeMillis())
.getOrCreate();


try {
for (int i = 0; i < 15; i++) {
Thread.sleep(1000);
System.out.println("sleeping 1");
}
} catch (InterruptedException e) {
e.printStackTrace();
}
//sparkSession.stop();

System.exit(-1);
}

}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495960#comment-15495960
 ] 

Aseem Bansal commented on SPARK-17560:
--

Can you share where this option needs to be set? Maybe I can try and add a pull 
request unless it is easier for you to just add a PR yourself instead of 
explaining.

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495906#comment-15495906
 ] 

Aseem Bansal commented on SPARK-17560:
--

Looked through 
https://spark.apache.org/docs/2.0.0/sql-programming-guide.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkConf.html

and none of them say anything about this parameter

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Description: 
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!

  was:
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems


> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Description: 
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!

Tried with browser cache disabled. Same issue

  was:
I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems

!screenshot-1.png!


> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems
> !screenshot-1.png!
> Tried with browser cache disabled. Same issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17561:
-
Attachment: screenshot-1.png

> DataFrameWriter documentation formatting problems
> -
>
> Key: SPARK-17561
> URL: https://issues.apache.org/jira/browse/SPARK-17561
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
> Attachments: screenshot-1.png
>
>
> I visited this page
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
> and saw  that the docs have formatting problems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17561) DataFrameWriter documentation formatting problems

2016-09-16 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17561:


 Summary: DataFrameWriter documentation formatting problems
 Key: SPARK-17561
 URL: https://issues.apache.org/jira/browse/SPARK-17561
 Project: Spark
  Issue Type: Documentation
Reporter: Aseem Bansal
 Attachments: screenshot-1.png

I visited this page
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html

and saw  that the docs have formatting problems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495862#comment-15495862
 ] 

Aseem Bansal commented on SPARK-17560:
--

No I did not. Where?

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15495862#comment-15495862
 ] 

Aseem Bansal edited comment on SPARK-17560 at 9/16/16 9:38 AM:
---

No I did not. Where? Had not set that in Spark 1.4 either


was (Author: anshbansal):
No I did not. Where?

> SQLContext tables returns table names in lower case only
> 
>
> Key: SPARK-17560
> URL: https://issues.apache.org/jira/browse/SPARK-17560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I registered a table using
> dataSet.createOrReplaceTempView("TestTable");
> Then I tried to get the list of tables using 
> sparkSession.sqlContext().tableNames()
> but the name that I got was testtable. It used to give table names in proper 
> case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17560) SQLContext tables returns table names in lower case only

2016-09-16 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17560:


 Summary: SQLContext tables returns table names in lower case only
 Key: SPARK-17560
 URL: https://issues.apache.org/jira/browse/SPARK-17560
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I registered a table using

dataSet.createOrReplaceTempView("TestTable");

Then I tried to get the list of tables using 

sparkSession.sqlContext().tableNames()

but the name that I got was testtable. It used to give table names in proper 
case in Spark 1.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

2016-09-06 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1540#comment-1540
 ] 

Aseem Bansal commented on SPARK-17307:
--

Not adding it there would be fine. But there needs to be something. Also for 
contributions I tried searching for the file but could not. In which branch are 
you working?

> Document what all access is needed on S3 bucket when trying to save a model
> ---
>
> Key: SPARK-17307
> URL: https://issues.apache.org/jira/browse/SPARK-17307
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
>Priority: Minor
>
> I faced this lack of documentation when I was trying to save a model to S3. 
> Initially I thought it should be only write. Then I found it also needs 
> delete to delete temporary files. Now I requested access for delete and tried 
> again and I am get the error
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
> org.jets3t.service.S3ServiceException: S3 PUT failed for 
> '/dev-qa_%24folder%24' XML Error Message
> To reproduce this error the below can be used
> {code}
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("my app")
> .master("local") 
> .getOrCreate();
> JavaSparkContext jsc = new 
> JavaSparkContext(sparkSession.sparkContext());
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", );
> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",  ACCESS KEY>);
> //Create a Pipelinemode
> 
> pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest");
> {code}
> This back and forth could be avoided if it was clearly mentioned what all 
> access spark needs to write to S3. Also would be great if why all of the 
> access is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

2016-09-01 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15454791#comment-15454791
 ] 

Aseem Bansal commented on SPARK-17307:
--

I would add that bit of information at 
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/util/MLWritable.html#save(java.lang.String)

Something like it needs complete read write access when using with S3 should be 
enough.

> Document what all access is needed on S3 bucket when trying to save a model
> ---
>
> Key: SPARK-17307
> URL: https://issues.apache.org/jira/browse/SPARK-17307
> Project: Spark
>  Issue Type: Documentation
>Reporter: Aseem Bansal
>Priority: Minor
>
> I faced this lack of documentation when I was trying to save a model to S3. 
> Initially I thought it should be only write. Then I found it also needs 
> delete to delete temporary files. Now I requested access for delete and tried 
> again and I am get the error
> Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
> org.jets3t.service.S3ServiceException: S3 PUT failed for 
> '/dev-qa_%24folder%24' XML Error Message
> To reproduce this error the below can be used
> {code}
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("my app")
> .master("local") 
> .getOrCreate();
> JavaSparkContext jsc = new 
> JavaSparkContext(sparkSession.sparkContext());
> jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", );
> jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",  ACCESS KEY>);
> //Create a Pipelinemode
> 
> pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest");
> {code}
> This back and forth could be avoided if it was clearly mentioned what all 
> access spark needs to write to S3. Also would be great if why all of the 
> access is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17307) Document what all access is needed on S3 bucket when trying to save a model

2016-08-29 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17307:


 Summary: Document what all access is needed on S3 bucket when 
trying to save a model
 Key: SPARK-17307
 URL: https://issues.apache.org/jira/browse/SPARK-17307
 Project: Spark
  Issue Type: Documentation
Reporter: Aseem Bansal


I faced this lack of documentation when I was trying to save a model to S3. 
Initially I thought it should be only write. Then I found it also needs delete 
to delete temporary files. Now I requested access for delete and tried again 
and I am get the error

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception: 
org.jets3t.service.S3ServiceException: S3 PUT failed for '/dev-qa_%24folder%24' 
XML Error Message

To reproduce this error the below can be used

{code}
SparkSession sparkSession = SparkSession
.builder()
.appName("my app")
.master("local") 
.getOrCreate();

JavaSparkContext jsc = new 
JavaSparkContext(sparkSession.sparkContext());

jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", );
jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", );

//Create a Pipelinemode


pipelineModel.write().overwrite().save("s3n:///dev-qa/modelTest");
{code}

This back and forth could be avoided if it was clearly mentioned what all 
access spark needs to write to S3. Also would be great if why all of the access 
is needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17012) Reading data frames via CSV - Allow to specify default value for integers

2016-08-10 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-17012:
-
Description: 
Currently the option that we have in DataFrameReader is nullValue which allows 
us one default. But say in our data frame we have string and integers and we 
want to specify the default for strings and integers differently that is 
currently not possible.

If it is done for different data types then it should be possible to allow to 
specify the schema to be nullable false when inferring schema (as a new option).

  was:Currently the option that we have in DataFrameReader is nullValue which 
allows us one default. But say in our data frame we have string and integers 
and we want to specify the default for strings and integers differently that is 
currently not possible.


> Reading data frames via CSV - Allow to specify default value for integers
> -
>
> Key: SPARK-17012
> URL: https://issues.apache.org/jira/browse/SPARK-17012
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> Currently the option that we have in DataFrameReader is nullValue which 
> allows us one default. But say in our data frame we have string and integers 
> and we want to specify the default for strings and integers differently that 
> is currently not possible.
> If it is done for different data types then it should be possible to allow to 
> specify the schema to be nullable false when inferring schema (as a new 
> option).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17012) Reading data frames via CSV - Allow to specify default value for integers

2016-08-10 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-17012:


 Summary: Reading data frames via CSV - Allow to specify default 
value for integers
 Key: SPARK-17012
 URL: https://issues.apache.org/jira/browse/SPARK-17012
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Aseem Bansal


Currently the option that we have in DataFrameReader is nullValue which allows 
us one default. But say in our data frame we have string and integers and we 
want to specify the default for strings and integers differently that is 
currently not possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-05 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409260#comment-15409260
 ] 

Aseem Bansal commented on SPARK-16893:
--

Yes. I would expect it to work without the use of format function as spark's 
documentation does not tell me anything about the need to use the format when 
using the csv function. 

> Spark CSV Provider option is not documented
> ---
>
> Key: SPARK-16893
> URL: https://issues.apache.org/jira/browse/SPARK-16893
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> I was working with databricks spark csv library and came across an error. I 
> have logged the issue in their github but it would be good to document that 
> in Apache Spark's documentation also
> I faced it with CSV. Someone else faced that with JSON 
> http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file
> Complete Issue details here
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-05 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409183#comment-15409183
 ] 

Aseem Bansal commented on SPARK-16893:
--

Reading a CSV causes an exception. Code used and excpetion are below. Also 
present in the github issue that I have referenced here.

{code}
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("my app")
.getOrCreate();

Dataset df = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("nullValue", "")
.csv("/home/aseem/data.csv")
;

df.show();
}
{code}

bq. Exception in thread "main" java.lang.RuntimeException: Multiple sources 
found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.

People need to use format("csv"). I think that is counter intuitive seeing that 
I am using the CSV method.

> Spark CSV Provider option is not documented
> ---
>
> Key: SPARK-16893
> URL: https://issues.apache.org/jira/browse/SPARK-16893
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>Priority: Minor
>
> I was working with databricks spark csv library and came across an error. I 
> have logged the issue in their github but it would be good to document that 
> in Apache Spark's documentation also
> I faced it with CSV. Someone else faced that with JSON 
> http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file
> Complete Issue details here
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16895) Reading empty string from csv has changed behaviour

2016-08-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408892#comment-15408892
 ] 

Aseem Bansal edited comment on SPARK-16895 at 8/5/16 5:19 AM:
--

I see that this is duplicate. Regarding it being a bug or not I heard someone 
say this related to frameworks. 

> If a feature is not documented it does not exist. If a change is not 
> documented then it is a bug.


was (Author: anshbansal):
I understand that it is duplicate. Regarding it being a bug or not I heard 
someone say this. 

> If a feature is not documented it does not exist. If a change is not 
> documented then it is a bug.

> Reading empty string from csv has changed behaviour
> ---
>
> Key: SPARK-16895
> URL: https://issues.apache.org/jira/browse/SPARK-16895
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I have a file called test.csv
> "a"
> ""
> When I read it in Spark 1.4 I get an empty string as value. When I read it in 
> 2.0 I get "null" as the String.
> The testing code is same as mentioned at
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16895) Reading empty string from csv has changed behaviour

2016-08-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408892#comment-15408892
 ] 

Aseem Bansal commented on SPARK-16895:
--

I understand that it is duplicate. Regarding it being a bug or not I heard 
someone say this. 

> If a feature is not documented it does not exist. If a change is not 
> documented then it is a bug.

> Reading empty string from csv has changed behaviour
> ---
>
> Key: SPARK-16895
> URL: https://issues.apache.org/jira/browse/SPARK-16895
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I have a file called test.csv
> "a"
> ""
> When I read it in Spark 1.4 I get an empty string as value. When I read it in 
> 2.0 I get "null" as the String.
> The testing code is same as mentioned at
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16896) Loading csv with duplicate column names

2016-08-04 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-16896:


 Summary: Loading csv with duplicate column names
 Key: SPARK-16896
 URL: https://issues.apache.org/jira/browse/SPARK-16896
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Aseem Bansal


It would be great if the library allows us to load csv with duplicate column 
names. I understand that having duplicate columns in the data is odd but 
sometimes we get data that has duplicate columns. Getting upstream data like 
that can happen. We may choose to ignore them but currently there is no way to 
drop those as we are not able to load them at all. Currently as a 
pre-processing I loaded the data into R, changed the column names and then make 
a fixed version with which Spark Java API can work.

But if talk about other options, e.g. R has read.csv which automatically takes 
care of such situation by appending a number to the column name.

Also case sensitivity in column names can also cause problems. I mean if we 
have columns like

ColumnName, columnName

I may want to have them as separate. But the option to do this is not 
documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16896) Loading csv with duplicate column names

2016-08-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407604#comment-15407604
 ] 

Aseem Bansal commented on SPARK-16896:
--

[~hyukjin.kwon] cc

> Loading csv with duplicate column names
> ---
>
> Key: SPARK-16896
> URL: https://issues.apache.org/jira/browse/SPARK-16896
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> It would be great if the library allows us to load csv with duplicate column 
> names. I understand that having duplicate columns in the data is odd but 
> sometimes we get data that has duplicate columns. Getting upstream data like 
> that can happen. We may choose to ignore them but currently there is no way 
> to drop those as we are not able to load them at all. Currently as a 
> pre-processing I loaded the data into R, changed the column names and then 
> make a fixed version with which Spark Java API can work.
> But if talk about other options, e.g. R has read.csv which automatically 
> takes care of such situation by appending a number to the column name.
> Also case sensitivity in column names can also cause problems. I mean if we 
> have columns like
> ColumnName, columnName
> I may want to have them as separate. But the option to do this is not 
> documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407601#comment-15407601
 ] 

Aseem Bansal commented on SPARK-16893:
--

[~hyukjin.kwon] cc

> Spark CSV Provider option is not documented
> ---
>
> Key: SPARK-16893
> URL: https://issues.apache.org/jira/browse/SPARK-16893
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I was working with databricks spark csv library and came across an error. I 
> have logged the issue in their github but it would be good to document that 
> in Apache Spark's documentation also
> I faced it with CSV. Someone else faced that with JSON 
> http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file
> Complete Issue details here
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16895) Reading empty string from csv has changed behaviour

2016-08-04 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407602#comment-15407602
 ] 

Aseem Bansal commented on SPARK-16895:
--

[~hyukjin.kwon] cc

> Reading empty string from csv has changed behaviour
> ---
>
> Key: SPARK-16895
> URL: https://issues.apache.org/jira/browse/SPARK-16895
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I have a file called test.csv
> "a"
> ""
> When I read it in Spark 1.4 I get an empty string as value. When I read it in 
> 2.0 I get "null" as the String.
> The testing code is same as mentioned at
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16895) Reading empty string from csv has changed behaviour

2016-08-04 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-16895:


 Summary: Reading empty string from csv has changed behaviour
 Key: SPARK-16895
 URL: https://issues.apache.org/jira/browse/SPARK-16895
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I have a file called test.csv

"a"
""

When I read it in Spark 1.4 I get an empty string as value. When I read it in 
2.0 I get "null" as the String.

The testing code is same as mentioned at
https://github.com/databricks/spark-csv/issues/367




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-04 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-16893:
-
Description: 
I was working with databricks spark csv library and came across an error. I 
have logged the issue in their github but it would be good to document that in 
Apache Spark's documentation also

I faced it with CSV. Someone else faced that with JSON 
http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file

Complete Issue details here
https://github.com/databricks/spark-csv/issues/367

  was:
I was working with databricks spark csv library and came across an error. I 
have logged the issue in their github but it would be good to document that in 
Apache Spark's documentation also

Details here
https://github.com/databricks/spark-csv/issues/367


> Spark CSV Provider option is not documented
> ---
>
> Key: SPARK-16893
> URL: https://issues.apache.org/jira/browse/SPARK-16893
> Project: Spark
>  Issue Type: Documentation
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
>
> I was working with databricks spark csv library and came across an error. I 
> have logged the issue in their github but it would be good to document that 
> in Apache Spark's documentation also
> I faced it with CSV. Someone else faced that with JSON 
> http://stackoverflow.com/questions/38761920/spark2-0-error-multiple-sources-found-for-json-when-read-json-file
> Complete Issue details here
> https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16893) Spark CSV Provider option is not documented

2016-08-04 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-16893:


 Summary: Spark CSV Provider option is not documented
 Key: SPARK-16893
 URL: https://issues.apache.org/jira/browse/SPARK-16893
 Project: Spark
  Issue Type: Documentation
Affects Versions: 2.0.0
Reporter: Aseem Bansal


I was working with databricks spark csv library and came across an error. I 
have logged the issue in their github but it would be good to document that in 
Apache Spark's documentation also

Details here
https://github.com/databricks/spark-csv/issues/367



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9678) HTTP request to BlockManager port yields exception

2015-08-07 Thread Aseem Bansal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14662321#comment-14662321
 ] 

Aseem Bansal commented on SPARK-9678:
-

I understand. Just thought to mention that.

> HTTP request to BlockManager port yields exception
> --
>
> Key: SPARK-9678
> URL: https://issues.apache.org/jira/browse/SPARK-9678
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
> Environment: Ubuntu 14.0.4
>Reporter: Aseem Bansal
>Priority: Minor
>
> I was going through the quick start for spark 1.4.1 at 
> http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. 
> Also the exact version that I am using is spark-1.4.1-bin-hadoop2.4
> The quick start has textFile = sc.textFile("README.md"). I ran that and then 
> the following text appeared in the command line
> {noformat}
> 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
> curMem=0, maxMem=278302556
> 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
> memory (estimated size 140.5 KB, free 265.3 MB)
> 15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
> curMem=143840, maxMem=278302556
> 15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 12.3 KB, free 265.3 MB)
> 15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on localhost:53311 (size: 12.3 KB, free: 265.4 MB)
> 15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
> NativeMethodAccessorImpl.java:-2
> {noformat}
> I saw that there was an IP in these logs i.e. localhost:53311
> I tried connecting to it via Google Chrome and got an exception.
> {noformat}
> >>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
> >>> from /127.0.0.1:54056
> io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
> 2147483647: 5135603447292250196 - discarded
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
>   at 
> io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9678) Exception while going through quick start

2015-08-05 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-9678:

Description: 
I was going through the quick start for spark 1.4.1 at 
http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also 
the exact version that I am using is spark-1.4.1-bin-hadoop2.4

The quick start has textFile = sc.textFile("README.md"). I ran that and then 
the following text appeared in the command line

{noformat}
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
curMem=0, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 140.5 KB, free 265.3 MB)
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
curMem=143840, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 12.3 KB, free 265.3 MB)
15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:53311 (size: 12.3 KB, free: 265.4 MB)
15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2
{noformat}

I saw that there was an IP in these logs i.e. localhost:53311

I tried connecting to it via Google Chrome and got an exception.

{noformat}
>>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
>>> from /127.0.0.1:54056
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
2147483647: 5135603447292250196 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
{noformat}

  was:
I was going through the quick start for spark 1.4.1 at 
http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also 
the exact version that I am using is spark-1.4.1-bin-hadoop2.4

The quick start has textFile = sc.textFile("README.md"). I ran that and then 
the following text appeared in the command line


15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
curMem=0, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 140.5 KB, free 265.3 MB)
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
curMem=143840, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 12.3 KB, free 265.3 MB)
15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:53311 (size: 12.3 KB, free: 265.4 MB)
15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2


I saw that there was an IP in these logs i.e. localhost:53311

I tried connecting to it via Google Chrome and got an exception.

>>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
>>> from /127.0.0.1:54056
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
2147483647: 5135603447292250196 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.de

[jira] [Updated] (SPARK-9678) Exception while going through quick start

2015-08-05 Thread Aseem Bansal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aseem Bansal updated SPARK-9678:

Description: 
I was going through the quick start for spark 1.4.1 at 
http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark. Also 
the exact version that I am using is spark-1.4.1-bin-hadoop2.4

The quick start has textFile = sc.textFile("README.md"). I ran that and then 
the following text appeared in the command line


15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
curMem=0, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 140.5 KB, free 265.3 MB)
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
curMem=143840, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 12.3 KB, free 265.3 MB)
15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:53311 (size: 12.3 KB, free: 265.4 MB)
15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2


I saw that there was an IP in these logs i.e. localhost:53311

I tried connecting to it via Google Chrome and got an exception.

>>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
>>> from /127.0.0.1:54056
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
2147483647: 5135603447292250196 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)


  was:
I was going through the quick start for spark 1.4.1 at 
http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark

The quick start has textFile = sc.textFile("README.md"). I ran that and then 
the following text appeared in the command line


15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
curMem=0, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 140.5 KB, free 265.3 MB)
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
curMem=143840, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 12.3 KB, free 265.3 MB)
15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:53311 (size: 12.3 KB, free: 265.4 MB)
15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2


I saw that there was an IP in these logs i.e. localhost:53311

I tried connecting to it via Google Chrome and got an exception.

>>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
>>> from /127.0.0.1:54056
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
2147483647: 5135603447292250196 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(

[jira] [Created] (SPARK-9678) Exception while going through quick start

2015-08-05 Thread Aseem Bansal (JIRA)

Aseem Bansal created SPARK-9678:
---

 Summary: Exception while going through quick start
 Key: SPARK-9678
 URL: https://issues.apache.org/jira/browse/SPARK-9678
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.1
 Environment: Ubuntu 14.0.4
Reporter: Aseem Bansal


I was going through the quick start for spark 1.4.1 at 
http://spark.apache.org/docs/latest/quick-start.html. I am using pySpark

The quick start has textFile = sc.textFile("README.md"). I ran that and then 
the following text appeared in the command line


15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(143840) called with 
curMem=0, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 140.5 KB, free 265.3 MB)
15/08/06 10:37:03 INFO MemoryStore: ensureFreeSpace(12633) called with 
curMem=143840, maxMem=278302556
15/08/06 10:37:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 12.3 KB, free 265.3 MB)
15/08/06 10:37:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 
localhost:53311 (size: 12.3 KB, free: 265.4 MB)
15/08/06 10:37:03 INFO SparkContext: Created broadcast 0 from textFile at 
NativeMethodAccessorImpl.java:-2


I saw that there was an IP in these logs i.e. localhost:53311

I tried connecting to it via Google Chrome and got an exception.

>>> 15/08/06 10:37:30 WARN TransportChannelHandler: Exception in connection 
>>> from /127.0.0.1:54056
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
2147483647: 5135603447292250196 - discarded
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

63 matches

Mail list logo