[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms
[ https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811163#comment-16811163 ] Saikat Kanjilal commented on SPARK-25994: - [~mju] I added some initial comments, will review PR next, would like to help out on this going forward, please advise on how best to help other then reviewing Design Doc and PR > SPIP: Property Graphs, Cypher Queries, and Algorithms > - > > Key: SPARK-25994 > URL: https://issues.apache.org/jira/browse/SPARK-25994 > Project: Spark > Issue Type: Epic > Components: Graph >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > Labels: SPIP > > Copied from the SPIP doc: > {quote} > GraphX was one of the foundational pillars of the Spark project, and is the > current graph component. This reflects the importance of the graphs data > model, which naturally pairs with an important class of analytic function, > the network or graph algorithm. > However, GraphX is not actively maintained. It is based on RDDs, and cannot > exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala > users. > GraphFrames is a Spark package, which implements DataFrame-based graph > algorithms, and also incorporates simple graph pattern matching with fixed > length patterns (called “motifs”). GraphFrames is based on DataFrames, but > has a semantically weak graph data model (based on untyped edges and > vertices). The motif pattern matching facility is very limited by comparison > with the well-established Cypher language. > The Property Graph data model has become quite widespread in recent years, > and is the primary focus of commercial graph data management and of graph > data research, both for on-premises and cloud data management. Many users of > transactional graph databases also wish to work with immutable graphs in > Spark. > The idea is to define a Cypher-compatible Property Graph type based on > DataFrames; to replace GraphFrames querying with Cypher; to reimplement > GraphX/GraphFrames algos on the PropertyGraph type. > To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), > reusing existing proven designs and code, will be employed in Spark 3.0. This > graph query processor, like CAPS, will overlay and drive the SparkSQL > Catalyst query engine, using the CAPS graph query planner. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms
[ https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755272#comment-16755272 ] Saikat Kanjilal commented on SPARK-25994: - [~mju] I would like to help out on this issue on the design/implementation, thoughts on the best steps to get involved, I will review the doc above as a first step > SPIP: Property Graphs, Cypher Queries, and Algorithms > - > > Key: SPARK-25994 > URL: https://issues.apache.org/jira/browse/SPARK-25994 > Project: Spark > Issue Type: New Feature > Components: GraphX >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Martin Junghanns >Priority: Major > Labels: SPIP > > Copied from the SPIP doc: > {quote} > GraphX was one of the foundational pillars of the Spark project, and is the > current graph component. This reflects the importance of the graphs data > model, which naturally pairs with an important class of analytic function, > the network or graph algorithm. > However, GraphX is not actively maintained. It is based on RDDs, and cannot > exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala > users. > GraphFrames is a Spark package, which implements DataFrame-based graph > algorithms, and also incorporates simple graph pattern matching with fixed > length patterns (called “motifs”). GraphFrames is based on DataFrames, but > has a semantically weak graph data model (based on untyped edges and > vertices). The motif pattern matching facility is very limited by comparison > with the well-established Cypher language. > The Property Graph data model has become quite widespread in recent years, > and is the primary focus of commercial graph data management and of graph > data research, both for on-premises and cloud data management. Many users of > transactional graph databases also wish to work with immutable graphs in > Spark. > The idea is to define a Cypher-compatible Property Graph type based on > DataFrames; to replace GraphFrames querying with Cypher; to reimplement > GraphX/GraphFrames algos on the PropertyGraph type. > To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), > reusing existing proven designs and code, will be employed in Spark 3.0. This > graph query processor, like CAPS, will overlay and drive the SparkSQL > Catalyst query engine, using the CAPS graph query planner. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084361#comment-16084361 ] Saikat Kanjilal commented on SPARK-18085: - [~vanzin] I would be interested in helping with this, I will first read the proposal and go through this thread and add my feedback, barring that are there areas where you need immediate help? > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19224) [PYSPARK] Python tests organization
Saikat Kanjilal created SPARK-19224: --- Summary: [PYSPARK] Python tests organization Key: SPARK-19224 URL: https://issues.apache.org/jira/browse/SPARK-19224 Project: Spark Issue Type: Test Components: PySpark Affects Versions: 2.1.0 Environment: Specific to pyspark Reporter: Saikat Kanjilal Fix For: 2.2.0 Move all pyspark tests to packages and separating into modules reflecting project structure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808216#comment-15808216 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] build/tests has passed in jenkins, what do you think? Should we commit a little at a time to keep this moving forward or should I make the next change, I prefer to create new pull requests for each unit test area that I fix so my preference would be to commit this little change for ContextCleaner. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738477#comment-15738477 ] Saikat Kanjilal commented on SPARK-9487: Then I would suggest keeping it open and focus on a particular module and make the unit tests robust in that module, is there a specific module that's in dire need of robustness of unit tests, I was thinking of picking the sql module and moving forward to make the unit tests under that be more robust as a first goal, thoughts? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738419#comment-15738419 ] Saikat Kanjilal commented on SPARK-9487: I'm ok closing it actually but it does outline issues with robustness around the unit tests, should we open up another jira or reframe this effort to make the unit tests more robust, that may require some more thought/redesign to produce identical results locally as well as in jenkins, my vote would be to close this out and recreate another jira that I can take on to make the unit tests more robust for 1 specific component with very narrowly defined goals, what do you think? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738309#comment-15738309 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] I think the above plan is great minus one fundamental flaw, I already have tests passing uniformly across multiple components locally, the issue I am running into is trying to get the tests working in jenkins, currently every change I've made locally passes unit tests.Until the issue with my local environment and jenkins gets resolved I don't see a clever way to get tests to pass , let me know your thoughts on a good way to get past this. After we figure this out I can pick a set of components to work with a uniform number of threads. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15733342#comment-15733342 ] Saikat Kanjilal commented on SPARK-9487: Given the latest thread on the devlist thoughts [~srowen][~rxin] on next steps for this? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15687208#comment-15687208 ] Saikat Kanjilal commented on SPARK-9487: I really want to finish the effort that I started for helping the community and will do my best to debug all the issues moving forward, for now I will skip ahead to the python tests to get those working and then come back and troubleshoot and fix all the test failures, sound reasonable? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684758#comment-15684758 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] following up, thoughts on how to proceed on these, I looked through , for example I looked at LogisticRegresionSuite and I dont see anything about even specifying local[2] versus local[4]? Thoughts on how to proceed on these, they all succeed locally as I mentioned > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680080#comment-15680080 ] Saikat Kanjilal commented on SPARK-9487: Ok guess I spoke too soon :), onto the next set of challenges, jenkins build report is here: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/ I ran each of these tests individually as well as together as a suite and they all passed, any ideas on how to address these? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680080#comment-15680080 ] Saikat Kanjilal edited comment on SPARK-9487 at 11/19/16 11:59 PM: --- Ok guess I spoke too soon :), onto the next set of challenges, jenkins build report is here: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/ I ran each of these tests individually as well as together as a suite locally and they all passed, any ideas on how to address these? was (Author: kanjilal): Ok guess I spoke too soon :), onto the next set of challenges, jenkins build report is here: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/ I ran each of these tests individually as well as together as a suite and they all passed, any ideas on how to address these? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679880#comment-15679880 ] Saikat Kanjilal commented on SPARK-9487: Ok fixed the unit test, didnt have to resort to using Sets, was able to compare the contents of each of the lists to certify the tests, pull request is here: https://github.com/apache/spark/pull/15848 Once pull request passes I will start working on fixing all the examples and the python code. Let me know next steps > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674543#comment-15674543 ] Saikat Kanjilal commented on SPARK-9487: Sean, I took a look at the code and here it is: List> inputData = Arrays.asList( Arrays.asList("hello", "world"), Arrays.asList("hello", "moon"), Arrays.asList("hello")); List>> expected = Arrays.asList( Arrays.asList( new Tuple2<>("hello", 1L), new Tuple2<>("world", 1L)), Arrays.asList( new Tuple2<>("hello", 1L), new Tuple2<>("moon", 1L)), Arrays.asList( new Tuple2<>("hello", 1L))); JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, inputData, 1); JavaPairDStream counted = stream.countByValue(); JavaTestUtils.attachTestOutputStream(counted); List>> result = JavaTestUtils.runStreams(ssc, 3, 3); Assert.assertEquals(expected, result); As you can see the expected is assuming that the contents of the stream get counted accurately for every word, the output that gets generated through the flakiness just has hello,1 moon,1 reversed which I dont think matters, unless the goal of the test ist o identify words in order of how they enter the stream the expected and the actual answer are correct. Therefore net net the test is flaky, should I refactor the test to actually look at the word count and not the order, thoughts on next steps? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15665370#comment-15665370 ] Saikat Kanjilal commented on SPARK-9487: I am running the tests by the following command: ./build/mvn test -P... -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite , is that not the way Jenkins runs the tests , I noticed that locally I am getting the same error as Jenkins shown here (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68529/consoleFull) regardless of whether I set the configuration to local[2] or local[4]: expected:<[[(hello,1), (world,1)], [(hello,1), (moon,1)], [(hello,1)]]> but was:<[[(hello,1), (world,1)], [(moon,1), (hello,1)], [(hello,1)]]> > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660237#comment-15660237 ] Saikat Kanjilal commented on SPARK-9487: Understood, so I wanted a fresh look at this from a different dev environment, so on my macbook pro I tried changing the setting to local[2] and local[4] for JavaAPISuite, it seems that they both fail so yes mimicing the real Jeankins failure will be hard, should I close this pull request till this is fixed and resubmit a new one, I have no idea at this point how long debugging this or even replicating this will take, thoughts on a suitable set of next steps? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658479#comment-15658479 ] Saikat Kanjilal edited comment on SPARK-9487 at 11/11/16 11:20 PM: --- ok so I've spent the last hour or so doing deeper investigations into the failures, I used this https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a point of reference, listed below is what I found LogisticRegressionSuite (passed successfully on both my master and feature branches) OneVsRestSuite(passed successfully on both my master and feature branches) DataFrameStatSuite (passed successfully on both my master and feature branches) DataFrameSuite (passed successfully on both my master and feature branches) SQLQueryTestSuite (passed successfully on both my master and feature branches) ForeachSinkSuite (passed successfully on both my master and feature branches) JavaAPISuite (failed on both my master and feature branches) The master branch does not have any code changes from me and the feature branch of course does I am running individual tests by issuing commands like the following from the root directory based on the documentation: ./build/mvn test -P... -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite Therefore my conclusion so far based on the above jenkins report is that my changes have not introduced any new failures that were not already there, [~srowen] please let me know if my methodology is off anywhere was (Author: kanjilal): ok so I've spent the last hour or so doing deeper investigations into the failures, I used this https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a point of reference, listed below is what I found java/scala test my master branch my feature branch LogisticRegressionSuite success success OneVsRestSuite success success DataFrameStatSuite success success DataFrameSuite success success SQLQueryTestSuitesuccess success ForeachSinkSuite success success JavaAPISuite failure failure The master branch does not have any code changes from me and the feature branch of course does I am running individual tests by issuing commands like the following from the root directory based on the documentation: ./build/mvn test -P... -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite Therefore my conclusion so far based on the above jenkins report is that my changes have not introduced any new failures that were not already there, [~srowen] please let me know if my methodology is off anywhere > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > genera
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658479#comment-15658479 ] Saikat Kanjilal commented on SPARK-9487: ok so I've spent the last hour or so doing deeper investigations into the failures, I used this https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68531/ as a point of reference, listed below is what I found java/scala test my master branch my feature branch LogisticRegressionSuite success success OneVsRestSuite success success DataFrameStatSuite success success DataFrameSuite success success SQLQueryTestSuitesuccess success ForeachSinkSuite success success JavaAPISuite failure failure The master branch does not have any code changes from me and the feature branch of course does I am running individual tests by issuing commands like the following from the root directory based on the documentation: ./build/mvn test -P... -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite Therefore my conclusion so far based on the above jenkins report is that my changes have not introduced any new failures that were not already there, [~srowen] please let me know if my methodology is off anywhere > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658144#comment-15658144 ] Saikat Kanjilal commented on SPARK-9487: No they don't which is why I asked, I will dig into these and resubmit. Point taken around making another PR > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15658055#comment-15658055 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] given this is my first patch, I wanted to understand a few things, I was looking at this: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68529/consoleFull and its not clear that these unit tests are in any way related to my changes, any insight on this, are these tests that happen to fail due to other dependencies missing, if so is someone else working to fix these? I will move onto working on python unit tests under the same PR next. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15657646#comment-15657646 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] if there's no further issues I will : 1) start working on another pull request to fix all the python unit test issues 2) in this other pull request I will include the fixes for the example to work as well as the TestSQLContext. Any objections to merging the pull request above? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651806#comment-15651806 ] Saikat Kanjilal edited comment on SPARK-9487 at 11/9/16 7:22 PM: - Ok , for some odd reason my local branch had the changes but weren't committed, PR is here: https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244 I added the change to local[4] to both streaming as well as repl, based on what I'm seeing locally all Java/Scala changes should be accounted for and unit tests pass, the only exception is the code inside spark examples in PageViewStream.scala, should I change this, seems like it doesn't belong as part of the unit tests. My next TODOs: 1) Change the example code if it makes sense in PageViewStream 2) Start the code changes to fix the python unit tests Let me know thoughts or concerns. was (Author: kanjilal): Ok , for some odd reason my local branch had the changes but weren't committed, PR is here: https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244 I added the change to local[4] to both streaming as well as repl, based on what I'm seeing locally all Java/Scala changes should be accounted for except for spark examples with the code inside PageViewStream.scala, should I change this, seems like it doesn't belong as part of the unit tests. My next TODOs: 1) Change the example code if it makes sense in PageViewStream 2) Start the code changes to fix the python unit tests Let me know thoughts or concerns. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651806#comment-15651806 ] Saikat Kanjilal commented on SPARK-9487: Ok , for some odd reason my local branch had the changes but weren't committed, PR is here: https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244 I added the change to local[4] to both streaming as well as repl, based on what I'm seeing locally all Java/Scala changes should be accounted for except for spark examples with the code inside PageViewStream.scala, should I change this, seems like it doesn't belong as part of the unit tests. My next TODOs: 1) Change the example code if it makes sense in PageViewStream 2) Start the code changes to fix the python unit tests Let me know thoughts or concerns. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651559#comment-15651559 ] Saikat Kanjilal commented on SPARK-9487: Sorry forgot to reply to your other question, from my checks I believe I had made all the java and scala changes as doing a simple find in IntelliIdea only shows the python changes being outstanding. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651553#comment-15651553 ] Saikat Kanjilal commented on SPARK-9487: I definitely want to figure out these test failures as a next step, however I think I'd like for folks to have the benefit of the changes to the Scala and Java changes independent of the python work. If that makes sense what are the next steps to commit this pull request with only the scala/java changes? To that end I will create a sub-branch focused on the python stuff and merge in those changes into my current branch once the tests are fixed. [~srowen], let me know your thoughts and if the above makes sense > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651496#comment-15651496 ] Saikat Kanjilal commented on SPARK-9487: ok I have moved onto python, I am attaching a log that contains test errors upon changing local[2] to local[4] on the ml module in python Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [Stage 49:> (0 + 3) / 3]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. ** File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", line 98, in __main__.GaussianMixture Failed example: model.gaussiansDF.show() Expected: +++ |mean| cov| +++ |[0.8250140229...|0.0056256...| |[-0.4777098016092...|0.167969502720916...| |[-0.4472625243352...|0.167304119758233...| +++ ... Got: +++ |mean| cov| +++ |[-0.6158006194417...|0.132188091748508...| |[0.54523101952701...|0.159129291449328...| |[0.54042985246699...|0.161430620150745...| +++ ** File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", line 123, in __main__.GaussianMixture Failed example: model2.gaussiansDF.show() Expected: +++ |mean| cov| +++ |[0.8250140229...|0.0056256...| |[-0.4777098016092...|0.167969502720916...| |[-0.4472625243352...|0.167304119758233...| +++ ... Got: +++ |mean| cov| +++ |[-0.6158006194417...|0.132188091748508...| |[0.54523101952701...|0.159129291449328...| |[0.54042985246699...|0.161430620150745...| +++ ** File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", line 656, in __main__.LDA Failed example: model.describeTopics().show() Expected: +-+---++ |topic|termIndices| termWeights| +-+---++ |0| [1, 0]|[0.50401530077160...| |1| [0, 1]|[0.50401530077160...| +-+---++ ... Got: +-+---++ |topic|termIndices| termWeights| +-+---++ |0| [1, 0]|[0.50010191915681...| |1| [0, 1]|[0.50010191915681...| +-+---++ ** File "/Users/skanjila/code/opensource/spark/python/pyspark/ml/clustering.py", line 664, in __main__.LDA Failed example: model.topicsMatrix() Expected: DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0) Got: DenseMatrix(2, 2, [0.4999, 0.5001, 0.5001, 0.4999], 0) ** 2 items had failures: 2 of 21 in __main__.GaussianMixture 2 of 20 in __main__.LDA ***Test Failed*** 4 failures. [~srowen][~holdenk] thoughts on next steps, should this pull request also contain code fixes to fix the errors that occur when changing local[2] to local[4] or should we break up the pull request into subcomponents, one focused on the scala pieces already submitted and the next focused on fixing the python code to work with local[4], thoughts on next steps? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResult
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634342#comment-15634342 ] Saikat Kanjilal commented on SPARK-9487: added local[4] to repl, sparksql, streaming, all tests pass, pull request is here: https://github.com/apache/spark/compare/master...skanjila:spark-9487 > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633695#comment-15633695 ] Saikat Kanjilal commented on SPARK-9487: [~srowen], [~holdenk] what are the next steps to drive this to the finish line, should I continue adding to this pull request and keep making the local[2]->local[4] changes through all of the codebase, would love some insight. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620657#comment-15620657 ] Saikat Kanjilal commented on SPARK-9487: Added org.apache.spark.mllib unitTest changes to pull request > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:46 PM: -- [~srowen] Yes I read through that link and adjusted the PR title, however please do let me know if I can proceed adding more to this PR including python and other parts of the codebase. was (Author: kanjilal): [~srowen] Yes I read through that link and adjusted the PR title, I will Jenkins test this next, however please do let me know if I can proceed adding more to this PR including python and other parts of the codebase. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613 ] Saikat Kanjilal commented on SPARK-9487: [~srowen] Yes I read through that and adjusted the PR title, I will Jenkins test this next, however please do let me know if I can proceed adding more to this PR including python and other parts of the codebase. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620613#comment-15620613 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/30/16 9:40 PM: -- [~srowen] Yes I read through that link and adjusted the PR title, I will Jenkins test this next, however please do let me know if I can proceed adding more to this PR including python and other parts of the codebase. was (Author: kanjilal): [~srowen] Yes I read through that and adjusted the PR title, I will Jenkins test this next, however please do let me know if I can proceed adding more to this PR including python and other parts of the codebase. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15620392#comment-15620392 ] Saikat Kanjilal commented on SPARK-9487: PR attached here: https://github.com/apache/spark/pull/15689 I only changed everything to local[4] in core and ran unit tests, all unit tests ran sucessfully This is a WIP so once have folks review this initial request and signed off I will start changing the python pieces [~holdenk][~sowen] let me know next steps > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599904#comment-15599904 ] Saikat Kanjilal commented on SPARK-9487: Ping on this, [~holdenk] can you let me know if I can move ahead with the above approach. Thanks > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587627#comment-15587627 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM: -- [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is the right process to run single unit tests, I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. was (Author: kanjilal): [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is this is the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587627#comment-15587627 ] Saikat Kanjilal edited comment on SPARK-9487 at 10/19/16 4:24 AM: -- [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -P hadoop2 -Dsuites= org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -P hadoop2 -Dsuites= org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is this is the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. was (Author: kanjilal): [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is not the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587627#comment-15587627 ] Saikat Kanjilal commented on SPARK-9487: [~holdenk] finally getting time to look at this, so I am starting small, I made the change inside ContextCleanerSuite and HeartbeatReceiverSuite from local[2] tp local[4], per the documentation here (http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) I ran mvn -Phadoop2 -Dsuites=org.apache.spark.HeartbeatReceiverSuite test--- looks like everything worked I then ran mvn -Phadoop2 -Dsuites=org.apache.spark.ContextCleanerSuite test-- looks like everything worked as well See the attachments and let me know if this is not the right process to run single unit tests, if not I'll start making changes to the other Suites , how would you like to see the output, should I just have attachments or just do a pull request from the new branch that I created? Thanks PS Another question, running single unit tests like this takes forever, are there flags I can set to speed up the builds, even on my 15 inch macbook pro with SSD the builds shouldnt take this long :(. Let me know next steps to get this into a PR. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat Kanjilal updated SPARK-9487: --- Attachment: ContextCleanerSuiteResults > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saikat Kanjilal updated SPARK-9487: --- Attachment: HeartbeatReceiverSuiteResults > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578723#comment-15578723 ] Saikat Kanjilal commented on SPARK-9487: Synched the code, am familiarizing myself first with how to run unit tests and work in the code , [~srowen][~holdenk], next steps will be to run the unit tests and report the results here, stay tuned. > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570185#comment-15570185 ] Saikat Kanjilal commented on SPARK-14516: - Hello, New to spark and am interested in helping with building a general purpose cluster evaluator, is the goal of this to use the metrics to evaluate overall clustering quality? [~akamal] [~josephkb] let me know how I can help. > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: zhengruifeng >Priority: Minor > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12372) Document limitations of MLlib local linear algebra
[ https://issues.apache.org/jira/browse/SPARK-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570152#comment-15570152 ] Saikat Kanjilal commented on SPARK-12372: - @josephkb, new to contributing to spark, is this something I can help with? > Document limitations of MLlib local linear algebra > -- > > Key: SPARK-12372 > URL: https://issues.apache.org/jira/browse/SPARK-12372 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Christos Iraklis Tsatsoulis > > This JIRA is now for documenting limitations of MLlib's local linear algebra > types. Basically, we should make it clear in the user guide that they > provide simple functionality but are not a full-fledged local linear library. > We should also recommend libraries for users to use in the meantime: > probably Breeze for Scala (and Java?) and numpy/scipy for Python. > *Original JIRA title*: Unary operator "-" fails for MLlib vectors > *Original JIRA text, as an example of the need for better docs*: > Consider the following snippet in pyspark 1.5.2: > {code:none} > >>> from pyspark.mllib.linalg import Vectors > >>> x = Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> x > DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]) > >>> -x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> y = Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> y > DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]) > >>> x-y > DenseVector([-2.0, 1.0, -3.0, 3.0, -5.0]) > >>> -y+x > Traceback (most recent call last): > File "", line 1, in > TypeError: func() takes exactly 2 arguments (1 given) > >>> -1*x > DenseVector([-0.0, -1.0, -0.0, -7.0, -0.0]) > {code} > Clearly, the unary operator {{-}} (minus) for vectors fails, giving errors > for expressions like {{-x}} and {{-y+x}}, despite the fact that {{x-y}} > behaves as expected. > The last operation, {{-1*x}}, although mathematically "correct", includes > minus signs for the zero entries, which again is normally not expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570122#comment-15570122 ] Saikat Kanjilal commented on SPARK-9487: Hello All, Can I help with this in anyway? Thanks > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570125#comment-15570125 ] Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:53 PM: holdenk@ can I help out with this? was (Author: kanjilal): heldenk@ can I help out with this? > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570125#comment-15570125 ] Saikat Kanjilal commented on SPARK-14212: - heldenk@ can I help out with this? > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14212) Add configuration element for --packages option
[ https://issues.apache.org/jira/browse/SPARK-14212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15570125#comment-15570125 ] Saikat Kanjilal edited comment on SPARK-14212 at 10/12/16 10:54 PM: @holdenk can I help out with this? was (Author: kanjilal): holdenk@ can I help out with this? > Add configuration element for --packages option > --- > > Key: SPARK-14212 > URL: https://issues.apache.org/jira/browse/SPARK-14212 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Affects Versions: 1.6.1 >Reporter: Russell Jurney >Priority: Trivial > Labels: config, starter > > I use PySpark with the --packages option, for instance to load support for > CSV: > pyspark --packages com.databricks:spark-csv_2.10:1.4.0 > I would like to not have to set this every time at the command line, so a > corresponding element for --packages in the configuration file > spark-defaults.conf, would be good to have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266029#comment-15266029 ] Saikat Kanjilal commented on SPARK-14302: - Works for me, so what else can I help with? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265118#comment-15265118 ] Saikat Kanjilal commented on SPARK-14302: - And here's the duplication inside the mllib directory: mllib Tasks in common initialize sparkContext MLUtils.loadLibSVMFile correlations && correlations example—share Statistics.corr, should be generalized like correlations_example gaussian mixture model and example kmeans and kmeans_example word2vec and word2vec_example We really should think about combining the example python files and the actual algorithms, what do you think (for example kmeans and kmeans_example) Let me know your thoughts on my duplication removal ideas above before I make any code changes > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264207#comment-15264207 ] Saikat Kanjilal commented on SPARK-14302: - Ok I finished my initial assessment of the ml directory, here are the things that I see duplicated: 1) The initialization of the SparkContext 2) The initialization of SqlContext 3) Creation of data frames Xusen should we move the functionality of the above into a common python file and have that be referenced by all the other files. I can create a Common.py and remove all the above code from all the files and have that be referenced if needed. I wil next begin my assesment of the mllib directory > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262825#comment-15262825 ] Saikat Kanjilal commented on SPARK-14302: - Ok thanks for the clarifications, so that means only remove duplicated code inside each of those directories individually, my bad I thought it was between the two, give me a few days to get a pull request together. I will need to revamp my approach as I thought it was comparing between them, my apologies. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262796#comment-15262796 ] Saikat Kanjilal commented on SPARK-14302: - Did you see my question above? Which of the directories is the directory to keep, I was waiting for some clarifications, I have a branch that I've been using to do this work in the meantime. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230434#comment-15230434 ] Saikat Kanjilal commented on SPARK-14302: - Xusen, I didnt hear back on the above question, I will merge the code the way I think it should work and send a patch, if you have any other thoughts let me know. Thanks > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222649#comment-15222649 ] Saikat Kanjilal edited comment on SPARK-14302 at 4/2/16 2:09 AM: - next question, which of the directories should contain the merged code, example I am looking at bisecting_k_means, the code in the two directories is very similar but use slightly different APIs, my recommendation would be to merge this code into 1 directory (either ml or mllb), so in general with my patch when I am merging code which of the directories should I put the result in? Also does 1 directory traditionally have the older code and i should always merge into the other? Thanks was (Author: kanjilal): next question, which of the directories should contain the merged code, example I am looking at bisecting_k_means, the code in the two directories is very similar but one trains the model before throwing test data at the model, my recommendation would be to merge this code into 1 directory (either ml or mllb), so in general with my patch when I am merging code which of the directories should I put the result in? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222649#comment-15222649 ] Saikat Kanjilal edited comment on SPARK-14302 at 4/2/16 2:04 AM: - next question, which of the directories should contain the merged code, example I am looking at bisecting_k_means, the code in the two directories is very similar but one trains the model before throwing test data at the model, my recommendation would be to merge this code into 1 directory (either ml or mllb), so in general with my patch when I am merging code which of the directories should I put the result in? was (Author: kanjilal): next question, which of the directories should contain the merged code, example I am looking at bisecting_k_means, the code in the two directories is very similar but one trains the model before throwing test data, my recommendation would be to merge this code into 1 directory (either ml or mllb), so in general with my patch when I am merging code which of the directories should I put the result in? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222649#comment-15222649 ] Saikat Kanjilal commented on SPARK-14302: - next question, which of the directories should contain the merged code, example I am looking at bisecting_k_means, the code in the two directories is very similar but one trains the model before throwing test data, my recommendation would be to merge this code into 1 directory (either ml or mllb), so in general with my patch when I am merging code which of the directories should I put the result in? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220330#comment-15220330 ] Saikat Kanjilal commented on SPARK-14302: - Ok thanks, so which should I take the python or the java piece to merge/delete example code? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220312#comment-15220312 ] Saikat Kanjilal edited comment on SPARK-14302 at 3/31/16 5:51 PM: -- If I understand this correctly the goal is just to compare the code in python/examples/mllib and python/examples/ml and contribute a patch that dedupes code , one question , which is the correct directory where spark needs to keep its python examples, is it ml or mllib? Actually we need to also do the java directories that you've listed as well : java/examples/ml and java/examples/mllib was (Author: kanjilal): If I understand this correctly the goal is just to compare the code in python/examples/mllib and python/examples/ml and contribute a patch that dedupes code , one question , which is the correct directory where spark needs to keep its python examples, is it ml or mllib? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220312#comment-15220312 ] Saikat Kanjilal commented on SPARK-14302: - If I understand this correctly the goal is just to compare the code in python/examples/mllib and python/examples/ml and contribute a patch that dedupes code , one question , which is the correct directory where spark needs to keep its python examples, is it ml or mllib? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220303#comment-15220303 ] Saikat Kanjilal commented on SPARK-14302: - Hi, Can I help with this issue, I've been wanting to contribute to spark and finally have some time. Regards > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org