[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-04-05 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811163#comment-16811163
 ] 

Saikat Kanjilal commented on SPARK-25994:
-

[~mju] I added some initial comments, will review PR next, would like to help 
out on this going forward, please advise on how best to help other then 
reviewing Design Doc and PR

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: Epic
>  Components: Graph
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2019-01-29 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755272#comment-16755272
 ] 

Saikat Kanjilal commented on SPARK-25994:
-

[~mju] I would like to help out on this issue on the design/implementation, 
thoughts on the best steps to get involved, I will review the doc above as a 
first step

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-07-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084361#comment-16084361
 ] 

Saikat Kanjilal commented on SPARK-18085:
-

[~vanzin] I would be interested in helping with this, I will first read the 
proposal and go through this thread and add my feedback, barring that are there 
areas where you need immediate help?

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19224) [PYSPARK] Python tests organization

2017-01-14 Thread Saikat Kanjilal (JIRA)
Saikat Kanjilal created SPARK-19224:
---

 Summary: [PYSPARK] Python tests organization
 Key: SPARK-19224
 URL: https://issues.apache.org/jira/browse/SPARK-19224
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 2.1.0
 Environment: Specific to pyspark
Reporter: Saikat Kanjilal
 Fix For: 2.2.0


Move all pyspark tests to packages and separating into modules reflecting 
project structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2017-01-07 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15808216#comment-15808216
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen]  build/tests has passed in jenkins, what do you think?  Should we 
commit a little at a time to keep this moving forward or should I make the next 
change, I prefer to create new pull requests for each unit test area that I fix 
so my preference would be to commit this little change for ContextCleaner.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738477#comment-15738477
 ] 

Saikat Kanjilal commented on SPARK-9487:


Then I would suggest keeping it open and focus on a particular module and make 
the unit tests robust in that module, is there a specific module that's in dire 
need of robustness of unit tests, I was thinking of picking the sql module and 
moving forward to make the unit tests under that be more robust as a first 
goal, thoughts?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738419#comment-15738419
 ] 

Saikat Kanjilal commented on SPARK-9487:


I'm ok closing it actually but it does outline issues with robustness around 
the unit tests, should we open up another jira or reframe this effort to make 
the unit tests more robust, that may require some more thought/redesign to 
produce identical results locally as well as in jenkins, my vote would be to 
close this out and recreate another jira that I can take on to make the unit 
tests more robust for 1 specific component with very narrowly defined goals, 
what do you think?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-10 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15738309#comment-15738309
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] I think the above plan is great minus one fundamental flaw, I already 
have tests passing uniformly across multiple components locally, the issue I am 
running into is trying to get the tests working in jenkins, currently every 
change I've made locally passes unit tests.Until the issue with my local 
environment and jenkins gets resolved I don't see a clever way to get tests to 
pass , let me know your thoughts on a good way to get past this.  After we 
figure this out I can pick a set of components to work with a uniform number of 
threads.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-12-08 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15733342#comment-15733342
 ] 

Saikat Kanjilal commented on SPARK-9487:


Given the latest thread on the devlist thoughts [~srowen][~rxin] on next steps 
for this?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-22 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687208#comment-15687208
 ] 

Saikat Kanjilal commented on SPARK-9487:


I really want to finish the effort that I started for helping the community and 
will do my best to debug all the issues moving forward, for now I will skip 
ahead to the python tests to get those working and then come back and 
troubleshoot and fix all the test failures, sound reasonable?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-21 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684758#comment-15684758
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] following up, thoughts on how to proceed on these, I looked through , 
for example I looked at LogisticRegresionSuite and I dont see anything about 
even specifying local[2] versus local[4]?  Thoughts on how to proceed on these, 
they all succeed locally as I mentioned

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680080#comment-15680080
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite and they 
all passed, any ideas on how to address these?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680080#comment-15680080
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 11/19/16 11:59 PM:
---

Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite locally 
and they all passed, any ideas on how to address these?


was (Author: kanjilal):
Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite and they 
all passed, any ideas on how to address these?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679880#comment-15679880
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok fixed the unit test, didnt have to resort to using Sets, was able to compare 
the contents of each of the lists to certify the tests, pull request is here:  
https://github.com/apache/spark/pull/15848

Once pull request passes I will start working on fixing all the examples and 
the python code.  Let me know next steps

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-17 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674543#comment-15674543
 ] 

Saikat Kanjilal commented on SPARK-9487:


Sean,
I took a look at the code and here it is:

List inputData = Arrays.asList(
  Arrays.asList("hello", "world"),
  Arrays.asList("hello", "moon"),
  Arrays.asList("hello"));

List>> expected = Arrays.asList(
Arrays.asList(
new Tuple2<>("hello", 1L),
new Tuple2<>("world", 1L)),
Arrays.asList(
new Tuple2<>("hello", 1L),
new Tuple2<>("moon", 1L)),
Arrays.asList(
new Tuple2<>("hello", 1L)));

JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, 
inputData, 1);
JavaPairDStream counted = stream.countByValue();
JavaTestUtils.attachTestOutputStream(counted);
List>> result = JavaTestUtils.runStreams(ssc, 3, 
3);

Assert.assertEquals(expected, result);


As you can see the expected is assuming that the contents of the stream get 
counted accurately for every word, the output that gets generated through the 
flakiness just has hello,1 moon,1 reversed which I dont think matters, unless 
the goal of the test ist o identify words in order of how they enter the stream 
the expected and the actual answer are correct.  Therefore net net the test is 
flaky, should I refactor the test to actually look at the word count and not 
the order, thoughts on next steps?


> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-14 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665370#comment-15665370
 ] 

Saikat Kanjilal commented on SPARK-9487:


I am running the tests by the following command: ./build/mvn test -P... 
-DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite , is that 
not the way Jenkins runs the tests , I noticed that locally I am getting the 
same error as Jenkins shown here 
(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68529/consoleFull)
 regardless of whether I set the configuration to local[2] or local[4]:  
expected:<[[(hello,1), (world,1)], [(hello,1), (moon,1)], [(hello,1)]]> but 
was:<[[(hello,1), (world,1)], [(moon,1), (hello,1)], [(hello,1)]]>

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660237#comment-15660237
 ] 

Saikat Kanjilal commented on SPARK-9487:


Understood, so I wanted a fresh look at this from a different dev environment, 
so on my macbook pro I tried changing the setting to local[2] and local[4] for 
JavaAPISuite, it seems that they both fail so yes mimicing the real Jeankins 
failure will be hard, should I close this pull request till this is fixed and 
resubmit a new one, I have no idea at this point how long debugging this or 
even replicating this will take, thoughts on a suitable set of next steps?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-11 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658479#comment-15658479
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 11/11/16 11:20 PM:
---

ok so I've spent the last hour or so doing deeper investigations into the 
failures, I used this 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a 
point of reference, listed below is what I found



LogisticRegressionSuite  (passed successfully on both my master and feature 
branches)   
   
OneVsRestSuite(passed successfully on both my master and feature branches) 
DataFrameStatSuite   (passed successfully on both my master and feature 
branches)   
   
DataFrameSuite (passed successfully on both my master and feature branches) 

 
SQLQueryTestSuite  (passed successfully on both my master and feature branches) 

  
ForeachSinkSuite  (passed successfully on both my master and feature branches)  


JavaAPISuite   (failed on both my master and feature branches)


The master branch does not have any code changes from me and the feature branch 
of course does 

I am running individual tests by issuing commands like the following from the 
root directory based on the documentation:  ./build/mvn test -P... 
-DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite


Therefore my conclusion so far based on the above jenkins report is that my 
changes have not introduced any new failures that were not already there, 
[~srowen] please let me know if my methodology is off anywhere




was (Author: kanjilal):
ok so I've spent the last hour or so doing deeper investigations into the 
failures, I used this 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a 
point of reference, listed below is what I found


java/scala test   my master 
branch  
my feature branch 
LogisticRegressionSuite  success

success
OneVsRestSuite success  

  success
DataFrameStatSuite   success

 success
DataFrameSuite  success 

success
SQLQueryTestSuitesuccess

 success
ForeachSinkSuite   success  

   success
JavaAPISuite  
failure 
   failure


The master branch does not have any code changes from me and the feature branch 
of course does 

I am running individual tests by issuing commands like the following from the 
root directory based on the documentation:  ./build/mvn test -P... 
-DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite


Therefore my conclusion so far based on the above jenkins report is that my 
changes have not introduced any new failures that were not already there, 
[~srowen] please let me know if my methodology is off anywhere



> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will 

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-11 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658479#comment-15658479
 ] 

Saikat Kanjilal commented on SPARK-9487:


ok so I've spent the last hour or so doing deeper investigations into the 
failures, I used this 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a 
point of reference, listed below is what I found


java/scala test   my master 
branch  
my feature branch 
LogisticRegressionSuite  success

success
OneVsRestSuite success  

  success
DataFrameStatSuite   success

 success
DataFrameSuite  success 

success
SQLQueryTestSuitesuccess

 success
ForeachSinkSuite   success  

   success
JavaAPISuite  
failure 
   failure


The master branch does not have any code changes from me and the feature branch 
of course does 

I am running individual tests by issuing commands like the following from the 
root directory based on the documentation:  ./build/mvn test -P... 
-DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite


Therefore my conclusion so far based on the above jenkins report is that my 
changes have not introduced any new failures that were not already there, 
[~srowen] please let me know if my methodology is off anywhere



> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-11 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658144#comment-15658144
 ] 

Saikat Kanjilal commented on SPARK-9487:


No they don't which is why I asked, I will dig into these and resubmit.  Point 
taken around making another PR

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-11 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658055#comment-15658055
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] given this is my first patch, I wanted to understand a few things, I 
was looking at this:  
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68529/consoleFull
 and its not clear that these unit tests are in any way related to my changes, 
any insight on this, are these tests that happen to fail due to other 
dependencies missing, if so is someone else working to fix these?

I will move onto working on python unit tests under the same PR next.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-11 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657646#comment-15657646
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] if there's no further issues I will : 1) start working on another 
pull request to fix all the python unit test issues 2) in this other pull 
request I will include the fixes for the example to work as well as the 
TestSQLContext.   Any objections to merging the pull request above?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651806#comment-15651806
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 11/9/16 7:22 PM:
-

Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for and unit 
tests pass, the only exception is the code inside spark examples in 
PageViewStream.scala, should I change this, seems like it doesn't belong as 
part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.


was (Author: kanjilal):
Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for except for 
spark examples with the code inside PageViewStream.scala, should I change this, 
seems like it doesn't belong as part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651806#comment-15651806
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for except for 
spark examples with the code inside PageViewStream.scala, should I change this, 
seems like it doesn't belong as part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651559#comment-15651559
 ] 

Saikat Kanjilal commented on SPARK-9487:


Sorry forgot to reply to your other question, from my checks I believe I had 
made all the java and scala changes as doing a simple find in IntelliIdea only 
shows the python changes being outstanding. 

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651553#comment-15651553
 ] 

Saikat Kanjilal commented on SPARK-9487:


I definitely want to figure out these test failures as a next step, however I 
think I'd like for folks to have the benefit of the changes to the Scala and 
Java changes independent of the python work.  If that makes sense what are the 
next steps to commit this pull request with only the scala/java changes?  To 
that end I will create a sub-branch focused on the python stuff and merge in 
those changes into my current branch once the tests are fixed.


[~srowen], let me know your thoughts and if the above makes sense

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651496#comment-15651496
 ] 

Saikat Kanjilal commented on SPARK-9487:


ok I have moved onto python, I am attaching a log that contains test errors 
upon changing local[2] to local[4] on the ml module in python



Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).

[Stage 49:> (0 + 3) / 
3]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.


**
File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", 
line 98, in __main__.GaussianMixture
Failed example:
model.gaussiansDF.show()
Expected:
+++
|mean| cov|
+++
|[0.8250140229...|0.0056256...|
|[-0.4777098016092...|0.167969502720916...|
|[-0.4472625243352...|0.167304119758233...|
+++
...
Got:
+++
|mean| cov|
+++
|[-0.6158006194417...|0.132188091748508...|
|[0.54523101952701...|0.159129291449328...|
|[0.54042985246699...|0.161430620150745...|
+++

**
File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", 
line 123, in __main__.GaussianMixture
Failed example:
model2.gaussiansDF.show()
Expected:
+++
|mean| cov|
+++
|[0.8250140229...|0.0056256...|
|[-0.4777098016092...|0.167969502720916...|
|[-0.4472625243352...|0.167304119758233...|
+++
...
Got:
+++
|mean| cov|
+++
|[-0.6158006194417...|0.132188091748508...|
|[0.54523101952701...|0.159129291449328...|
|[0.54042985246699...|0.161430620150745...|
+++

**
File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", 
line 656, in __main__.LDA
Failed example:
model.describeTopics().show()
Expected:
+-+---++
|topic|termIndices| termWeights|
+-+---++
|0| [1, 0]|[0.50401530077160...|
|1| [0, 1]|[0.50401530077160...|
+-+---++
...
Got:
+-+---++
|topic|termIndices| termWeights|
+-+---++
|0| [1, 0]|[0.50010191915681...|
|1| [0, 1]|[0.50010191915681...|
+-+---++

**
File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", 
line 664, in __main__.LDA
Failed example:
model.topicsMatrix()
Expected:
DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0)
Got:
DenseMatrix(2, 2, [0.4999, 0.5001, 0.5001, 0.4999], 0)
**
2 items had failures:
   2 of  21 in __main__.GaussianMixture
   2 of  20 in __main__.LDA
***Test Failed*** 4 failures.



[~srowen][~holdenk]  thoughts on next steps, should this pull request also 
contain code fixes to fix the errors that occur when changing local[2] to 
local[4] or should we break up the pull request into subcomponents, one focused 
on the scala pieces already submitted and the next focused on fixing the python 
code to work with local[4], thoughts on next steps?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In 

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634342#comment-15634342
 ] 

Saikat Kanjilal commented on SPARK-9487:


added local[4] to repl, sparksql, streaming, all tests pass, pull request is 
here: https://github.com/apache/spark/compare/master...skanjila:spark-9487


> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633695#comment-15633695
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen], [~holdenk]  what are the next steps to drive this to the finish 
line, should I continue adding to this pull request and keep making the 
local[2]->local[4] changes through all of the codebase, would love some insight.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620657#comment-15620657
 ] 

Saikat Kanjilal commented on SPARK-9487:


Added org.apache.spark.mllib unitTest changes to pull request

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620613#comment-15620613
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:46 PM:
--

[~srowen] Yes I read through that link and adjusted the PR title, however 
please do let me know if I can proceed adding more to this PR including python 
and other parts of the codebase.


was (Author: kanjilal):
[~srowen] Yes I read through that link and adjusted the PR title, I will 
Jenkins test this next, however please do let me know if I can proceed adding 
more to this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620613#comment-15620613
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~srowen] Yes I read through that and adjusted the PR title, I will Jenkins 
test this next, however please do let me know if I can proceed adding more to 
this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620613#comment-15620613
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:40 PM:
--

[~srowen] Yes I read through that link and adjusted the PR title, I will 
Jenkins test this next, however please do let me know if I can proceed adding 
more to this PR including python and other parts of the codebase.


was (Author: kanjilal):
[~srowen] Yes I read through that and adjusted the PR title, I will Jenkins 
test this next, however please do let me know if I can proceed adding more to 
this PR including python and other parts of the codebase.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-30 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620392#comment-15620392
 ] 

Saikat Kanjilal commented on SPARK-9487:


PR attached here: https://github.com/apache/spark/pull/15689
I only changed everything to local[4] in core and ran unit tests, all unit 
tests ran sucessfully


This is a WIP so once have folks review this initial request and signed off I 
will start changing the python pieces

[~holdenk][~sowen]  let me know next steps

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-23 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15599904#comment-15599904
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ping on this, [~holdenk] can you let me know if I can move ahead with the above 
approach.
Thanks

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM:
--

[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is the right process to run single 
unit tests,  I'll start making changes to the other Suites , how would you like 
to see the output, should I just have attachments or just do a pull request 
from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.






was (Author: kanjilal):
[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is this is the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM:
--

[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is this is the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.






was (Author: kanjilal):
[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is not the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587627#comment-15587627
 ] 

Saikat Kanjilal commented on SPARK-9487:


[~holdenk] finally getting time to look at this, so I am starting small, I made 
the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] 
tp local[4], per the documentation here 
(http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
 I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- 
looks like everything worked

I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- 
looks like everything worked as well

See the attachments and let me know if this is not the right process to run 
single unit tests, if not I'll start making changes to the other Suites , how 
would you like to see the output, should I just have attachments or just do a 
pull request from the new branch that I created?
Thanks

PS
Another question, running single unit tests like this takes forever, are there 
flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD 
the builds shouldnt take this long :(.  


Let me know next steps to get this into a PR.





> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikat Kanjilal updated SPARK-9487:
---
Attachment: ContextCleanerSuiteResults

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-18 Thread Saikat Kanjilal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikat Kanjilal updated SPARK-9487:
---
Attachment: HeartbeatReceiverSuiteResults

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-15 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15578723#comment-15578723
 ] 

Saikat Kanjilal commented on SPARK-9487:


Synched the code, am familiarizing myself first with how to run unit tests and 
work in the code , [~srowen][~holdenk], next steps will be to run the unit 
tests and report the results here, stay tuned.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570185#comment-15570185
 ] 

Saikat Kanjilal commented on SPARK-14516:
-

Hello,
New to spark and am interested in helping with building a general purpose 
cluster evaluator, is the goal of this to use the metrics to evaluate overall 
clustering quality? [~akamal] [~josephkb] let me know how I can help.



> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12372) Document limitations of MLlib local linear algebra

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570152#comment-15570152
 ] 

Saikat Kanjilal commented on SPARK-12372:
-

@josephkb, new to contributing to spark, is this something I can help with?

> Document limitations of MLlib local linear algebra
> --
>
> Key: SPARK-12372
> URL: https://issues.apache.org/jira/browse/SPARK-12372
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Christos Iraklis Tsatsoulis
>
> This JIRA is now for documenting limitations of MLlib's local linear algebra 
> types.  Basically, we should make it clear in the user guide that they 
> provide simple functionality but are not a full-fledged local linear library. 
>  We should also recommend libraries for users to use in the meantime: 
> probably Breeze for Scala (and Java?) and numpy/scipy for Python.
> *Original JIRA title*: Unary operator "-" fails for MLlib vectors
> *Original JIRA text, as an example of the need for better docs*:
> Consider the following snippet in pyspark 1.5.2:
> {code:none}
> >>> from pyspark.mllib.linalg import Vectors
> >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> x
> DenseVector([0.0, 1.0, 0.0, 7.0, 0.0])
> >>> -x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> y
> DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])
> >>> x-y
> DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0])
> >>> -y+x
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: func() takes exactly 2 arguments (1 given)
> >>> -1*x
> DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0])
> {code}
> Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors 
> for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} 
> behaves as expected.
> The last operation, {{-1*x}}, although mathematically "correct", includes 
> minus signs for the zero entries, which again is normally not expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570122#comment-15570122
 ] 

Saikat Kanjilal commented on SPARK-9487:


Hello All,
Can I help with this in anyway?
Thanks

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:53 PM:


holdenk@ can I help out with this?


was (Author: kanjilal):
heldenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal commented on SPARK-14212:
-

heldenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option

2016-10-12 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570125#comment-15570125
 ] 

Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:54 PM:


@holdenk can I help out with this?


was (Author: kanjilal):
holdenk@ can I help out with this?

> Add configuration element for --packages option
> ---
>
> Key: SPARK-14212
> URL: https://issues.apache.org/jira/browse/SPARK-14212
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Affects Versions: 1.6.1
>Reporter: Russell Jurney
>Priority: Trivial
>  Labels: config, starter
>
> I use PySpark with the --packages option, for instance to load support for 
> CSV: 
> pyspark --packages com.databricks:spark-csv_2.10:1.4.0
> I would like to not have to set this every time at the command line, so a 
> corresponding element for --packages in the configuration file 
> spark-defaults.conf, would be good to have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266029#comment-15266029
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Works for me, so what else can I help with?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-29 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265118#comment-15265118
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

And here's the duplication inside the mllib directory:

mllib
Tasks in common
initialize sparkContext
MLUtils.loadLibSVMFile

correlations && correlations example—share Statistics.corr, should be 
generalized like correlations_example
gaussian mixture model and example
kmeans and kmeans_example
word2vec and word2vec_example


We really should think about combining the example python files and the actual 
algorithms, what do you think (for example kmeans and kmeans_example)

Let me know your thoughts on my duplication removal ideas above before I make 
any code changes

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-29 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264207#comment-15264207
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Ok I finished my initial assessment of the ml directory, here are the things 
that I see duplicated:
1) The initialization of the SparkContext
2) The initialization of SqlContext
3) Creation of data frames

Xusen should we move the functionality of the above into a common python file 
and have that be referenced by all the other files.  I can create a Common.py 
and remove all the above code from all the files and have that be referenced if 
needed.

I wil next begin my assesment of the mllib directory

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-28 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262825#comment-15262825
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Ok thanks for the clarifications, so that means only remove duplicated code 
inside each of those directories individually, my bad I thought it was between 
the two, give me a few days to get a pull request together.  I will need to 
revamp my approach as I thought it was comparing between them, my apologies.

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-28 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262796#comment-15262796
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Did you see my question above?  Which of the directories is the directory to 
keep, I was waiting for some clarifications, I have a branch that I've been 
using to do this work in the meantime.

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-07 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230434#comment-15230434
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Xusen,
I didnt hear back on the above question, I will merge the code the way I think 
it should work and send a patch, if you have any other thoughts let me know.
Thanks

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up

2016-04-01 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222649#comment-15222649
 ] 

Saikat Kanjilal edited comment on SPARK-14302 at 4/2/16 2:09 AM:
-

next question, which of the directories should contain the merged code, example 
I am looking at bisecting_k_means, the code in the two directories is very 
similar but use slightly different APIs, my recommendation would be to merge 
this code into 1 directory (either ml or mllb), so in general with my patch 
when I am merging code which of the directories should I put the result in?  
Also does 1 directory traditionally have the older code and i should always 
merge into the other?
Thanks


was (Author: kanjilal):
next question, which of the directories should contain the merged code, example 
I am looking at bisecting_k_means, the code in the two directories is very 
similar but one trains the model before throwing test data at the model, my 
recommendation would be to merge this code into 1 directory (either ml or 
mllb), so in general with my patch when I am merging code which of the 
directories should I put the result in?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up

2016-04-01 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222649#comment-15222649
 ] 

Saikat Kanjilal edited comment on SPARK-14302 at 4/2/16 2:04 AM:
-

next question, which of the directories should contain the merged code, example 
I am looking at bisecting_k_means, the code in the two directories is very 
similar but one trains the model before throwing test data at the model, my 
recommendation would be to merge this code into 1 directory (either ml or 
mllb), so in general with my patch when I am merging code which of the 
directories should I put the result in?


was (Author: kanjilal):
next question, which of the directories should contain the merged code, example 
I am looking at bisecting_k_means, the code in the two directories is very 
similar but one trains the model before throwing test data, my recommendation 
would be to merge this code into 1 directory (either ml or mllb), so in general 
with my patch when I am merging code which of the directories should I put the 
result in?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-01 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222649#comment-15222649
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

next question, which of the directories should contain the merged code, example 
I am looking at bisecting_k_means, the code in the two directories is very 
similar but one trains the model before throwing test data, my recommendation 
would be to merge this code into 1 directory (either ml or mllb), so in general 
with my patch when I am merging code which of the directories should I put the 
result in?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220330#comment-15220330
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Ok thanks, so which should I take the python or the java piece to merge/delete 
example code?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220312#comment-15220312
 ] 

Saikat Kanjilal edited comment on SPARK-14302 at 3/31/16 5:51 PM:
--

If I understand this correctly the goal is just to compare the code in 
python/examples/mllib and python/examples/ml and contribute a patch that 
dedupes code , one question , which is the correct directory where spark needs 
to keep its python examples, is it ml or mllib?  Actually we need to also do 
the java directories that you've listed as well : java/examples/ml and 
java/examples/mllib


was (Author: kanjilal):
If I understand this correctly the goal is just to compare the code in 
python/examples/mllib and python/examples/ml and contribute a patch that 
dedupes code , one question , which is the correct directory where spark needs 
to keep its python examples, is it ml or mllib?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220312#comment-15220312
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

If I understand this correctly the goal is just to compare the code in 
python/examples/mllib and python/examples/ml and contribute a patch that 
dedupes code , one question , which is the correct directory where spark needs 
to keep its python examples, is it ml or mllib?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220303#comment-15220303
 ] 

Saikat Kanjilal commented on SPARK-14302:
-

Hi,
Can I help with this issue, I've been wanting to contribute to spark and 
finally have some time.
Regards

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org