[jira] [Commented] (SPARK-9460) Avoid byte array allocation in StringPrefixComparator
[ https://issues.apache.org/jira/browse/SPARK-9460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647305#comment-14647305 ] Apache Spark commented on SPARK-9460: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/7789 Avoid byte array allocation in StringPrefixComparator - Key: SPARK-9460 URL: https://issues.apache.org/jira/browse/SPARK-9460 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 StringPrefixComparator converts the long values back to byte arrays in order to compare them. We should be able to optimize this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte. {code} public int compare(long aPrefix, long bPrefix) { // TODO: can done more efficiently byte[] a = Longs.toByteArray(aPrefix); byte[] b = Longs.toByteArray(bPrefix); for (int i = 0; i 8; i++) { int c = UnsignedBytes.compare(a[i], b[i]); if (c != 0) return c; } return 0; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9335) Kinesis test hits rate limit
[ https://issues.apache.org/jira/browse/SPARK-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-9335. -- Resolution: Fixed Fix Version/s: 1.5.0 Kinesis test hits rate limit Key: SPARK-9335 URL: https://issues.apache.org/jira/browse/SPARK-9335 Project: Spark Issue Type: Bug Components: Streaming, Tests Reporter: Patrick Wendell Assignee: Tathagata Das Priority: Critical Fix For: 1.5.0 This test is failing many pull request builds because of rate limits: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38396/testReport/org.apache.spark.streaming.kinesis/KinesisBackedBlockRDDSuite/_It_is_not_a_test_/ I disabled the test. I wonder if it's better to not have this test run by default since it's a bit brittle to depend on an external system like this (what if Kinesis goes down, for instance, it will block all development). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8167) Tasks that fail due to YARN preemption can cause job failure
[ https://issues.apache.org/jira/browse/SPARK-8167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647313#comment-14647313 ] Jeff Zhang commented on SPARK-8167: --- [~mcheah] What's the status of this ticket ? I don't think blocking RPC call is a good idea. I think we could just send executor preempted message to driver when the container is preempted. And let driver to decrease the numTaskAttemptFails. Although we lose some consistency here, at least we could avoid job failures due to preemption. And I think there's some gap between 2 consecutive failed task attempt, very likely in the gap the driver has received the executor preempted message. Thoughts ? Tasks that fail due to YARN preemption can cause job failure Key: SPARK-8167 URL: https://issues.apache.org/jira/browse/SPARK-8167 Project: Spark Issue Type: Bug Components: Scheduler, YARN Affects Versions: 1.3.1 Reporter: Patrick Woody Assignee: Matt Cheah Priority: Blocker Tasks that are running on preempted executors will count as FAILED with an ExecutorLostFailure. Unfortunately, this can quickly spiral out of control if a large resource shift is occurring, and the tasks get scheduled to executors that immediately get preempted as well. The current workaround is to increase spark.task.maxFailures very high, but that can cause delays in true failures. We should ideally differentiate these task statuses so that they don't count towards the failure limit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9470) Java API function interface cleanup
Rahul Kavale created SPARK-9470: --- Summary: Java API function interface cleanup Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6873) Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements
[ https://issues.apache.org/jira/browse/SPARK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647322#comment-14647322 ] Pete Robbins commented on SPARK-6873: - We've been trying to get a clean build/test using Java 8 and we see these errors so I think this is still a problem. It looks like the Catalyst output changes from Java 7 to Java 8. Is the ordering supposed to be defined for this or is the ordering really unimportant? Some Hive-Catalyst comparison tests fail due to unimportant order of some printed elements -- Key: SPARK-6873 URL: https://issues.apache.org/jira/browse/SPARK-6873 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Cheng Lian Priority: Minor As I mentioned, I've been seeing 4 test failures in Hive tests for a while, and actually it still affects master. I think it's a superficial problem that only turns up when running on Java 8, but still, would probably be an easy fix and good to fix. Specifically, here are four tests and the bit that fails the comparison, below. I tried to diagnose this but had trouble even finding where some of this occurs, like the list of synonyms? {code} - show_tblproperties *** FAILED *** Results do not match for show_tblproperties: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == !tmptruebar bar value !barbar value tmp true (HiveComparisonTest.scala:391) {code} {code} - show_create_table_serde *** FAILED *** Results do not match for show_create_table_serde: ... WITH SERDEPROPERTIES ( WITH SERDEPROPERTIES ( ! 'serialization.format'='$', 'field.delim'=',', ! 'field.delim'=',') 'serialization.format'='$') {code} {code} - udf_std *** FAILED *** Results do not match for udf_std: ... !== HIVE - 2 row(s) == == CATALYST - 2 row(s) == std(x) - Returns the standard deviation of a set of numbers std(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stddev Synonyms: stddev, stddev_pop (HiveComparisonTest.scala:391) {code} {code} - udf_stddev *** FAILED *** Results do not match for udf_stddev: ... !== HIVE - 2 row(s) ==== CATALYST - 2 row(s) == stddev(x) - Returns the standard deviation of a set of numbers stddev(x) - Returns the standard deviation of a set of numbers !Synonyms: stddev_pop, stdSynonyms: std, stddev_pop (HiveComparisonTest.scala:391) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg
[ https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647862#comment-14647862 ] Xiangrui Meng commented on SPARK-9408: -- If we want to have distributed matrix API in Python, this is required. Refactor mllib/linalg.py to mllib/linalg Key: SPARK-9408 URL: https://issues.apache.org/jira/browse/SPARK-9408 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar We need to refactor mllib/linalg.py to mllib/linalg so that the project structure is similar to that of Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7583) User guide update for RegexTokenizer
[ https://issues.apache.org/jira/browse/SPARK-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647904#comment-14647904 ] yuhao yang commented on SPARK-7583: --- I'd like to take a try if this is still needed. User guide update for RegexTokenizer Key: SPARK-7583 URL: https://issues.apache.org/jira/browse/SPARK-7583 Project: Spark Issue Type: Documentation Components: Documentation, ML Reporter: Joseph K. Bradley Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} Note: I created a new subsection for links to spark.ml-specific guides in this JIRA's PR: [SPARK-7557]. This transformer can go within the new subsection. I'll try to get that PR merged ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9277. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7794 [https://github.com/apache/spark/pull/7794] SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Fix For: 1.5.0 Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647884#comment-14647884 ] Jeremy Freeman commented on SPARK-9461: --- Interesting, the lack of failures on KMeans is consistent with the completion idea because those tests use extremely toy data ( 10 data points) whereas the regression ones use 100s of test data points. In this (and similar) lines: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tests.py#L1161 there's a parameter `end_time` that's the time to wait in seconds for it to complete. Looking across these tests, the value fluctuates (5, 10, 15, 20) suggesting that it was hand-tuned, possibly tailored to a local test environment. Bumping that number up for any of the tests showing occasional errors might fix it, though that's a little ad-hoc. I think things are more robust on the Scala side because there's a full-blown streaming test class that lets test jobs either run to completion, or until a max time out (https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala). So there's just one test-wide parameter, the max time out, and we could safely set that pretty high without wasting time. Possibly slightly flaky PySpark StreamingLinearRegressionWithTests -- Key: SPARK-9461 URL: https://issues.apache.org/jira/browse/SPARK-9461 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.5.0 Reporter: Joseph K. Bradley Assignee: Jeremy Freeman [~freeman-lab] Check out this failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull] It should be deterministic, but do you think it's just slight variations caused by the Python version? Or do you think it's something odd going on with streaming? This is the only time I've seen this happen, but I'll post again if I see it more. Test failure message: {code} == FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests) Test that coefs are predicted accurately by fitting on toy data. -- Traceback (most recent call last): File /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py, line 1282, in test_parameter_accuracy slr.latestModel().weights.array, [10., 10.], 1) File /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py, line 1257, in assertArrayAlmostEqual self.assertAlmostEqual(i, j, dec) AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647893#comment-14647893 ] K S Sreenivasa Raghavan commented on SPARK-6227: Hi, I have worked with pySpark in EdX courses. The course coordinators distributed some sparkVM to all participants of the course. I am interested in developing this package. I have even learnt scala. I have few doubts: 1. Please give me proper steps to install spark on my ubuntu desktop as I have no idea how to modify the spark code in VM. I tried all the methods given (as given by google search), But they failed. 2. For Pyspark, where should we write/ modify the codes? PCA and SVD for PySpark --- Key: SPARK-6227 URL: https://issues.apache.org/jira/browse/SPARK-6227 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.2.1 Reporter: Julien Amelot The Dimensionality Reduction techniques are not available via Python (Scala + Java only). * Principal component analysis (PCA) * Singular value decomposition (SVD) Doc: http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2089. -- Resolution: Won't Fix With YARN, preferredNodeLocalityData isn't honored --- Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8175) date/time function: from_unixtime
[ https://issues.apache.org/jira/browse/SPARK-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8175. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7644 [https://github.com/apache/spark/pull/7644] date/time function: from_unixtime - Key: SPARK-8175 URL: https://issues.apache.org/jira/browse/SPARK-8175 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Fix For: 1.5.0 from_unixtime(bigint unixtime[, string format]): string Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 1970-01-01 00:00:00. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8174) date/time function: unix_timestamp
[ https://issues.apache.org/jira/browse/SPARK-8174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-8174. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7644 [https://github.com/apache/spark/pull/7644] date/time function: unix_timestamp -- Key: SPARK-8174 URL: https://issues.apache.org/jira/browse/SPARK-8174 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Adrian Wang Priority: Blocker Fix For: 1.5.0 3 variants: {code} unix_timestamp(): long Gets current Unix timestamp in seconds. unix_timestamp(string|date): long Converts time string in format -MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801 unix_timestamp(string date, string pattern): long Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return 0 if fail: unix_timestamp('2009-03-20', '-MM-dd') = 1237532400. {code} See: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4449) specify port range in spark
[ https://issues.apache.org/jira/browse/SPARK-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648090#comment-14648090 ] Neelesh Srinivas Salian commented on SPARK-4449: I would like to pick this up and work on it. Could you please assign the JIRA to me? Thank you. specify port range in spark --- Key: SPARK-4449 URL: https://issues.apache.org/jira/browse/SPARK-4449 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Fei Wang Priority: Minor In some case, we need specify port range used in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9231) DistributedLDAModel method for top topics per document
[ https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9231: --- Assignee: Apache Spark DistributedLDAModel method for top topics per document -- Key: SPARK-9231 URL: https://issues.apache.org/jira/browse/SPARK-9231 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor Original Estimate: 48h Remaining Estimate: 48h Helper method in DistributedLDAModel of this form: {code} /** * For each document, return the top k weighted topics for that document. * @return RDD of (doc ID, topic indices, topic weights) */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9231) DistributedLDAModel method for top topics per document
[ https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9231: --- Assignee: (was: Apache Spark) DistributedLDAModel method for top topics per document -- Key: SPARK-9231 URL: https://issues.apache.org/jira/browse/SPARK-9231 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Original Estimate: 48h Remaining Estimate: 48h Helper method in DistributedLDAModel of this form: {code} /** * For each document, return the top k weighted topics for that document. * @return RDD of (doc ID, topic indices, topic weights) */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9428) Add test cases for null inputs for expression unit tests
[ https://issues.apache.org/jira/browse/SPARK-9428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9428. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7748 [https://github.com/apache/spark/pull/7748] Add test cases for null inputs for expression unit tests Key: SPARK-9428 URL: https://issues.apache.org/jira/browse/SPARK-9428 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yijie Shen Assignee: Yijie Shen Priority: Blocker Fix For: 1.5.0 We need to audit expression unit tests to make sure we pass in null inputs to test null behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8786) Create a wrapper for BinaryType
[ https://issues.apache.org/jira/browse/SPARK-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647248#comment-14647248 ] Takeshi Yamamuro edited comment on SPARK-8786 at 7/30/15 6:28 AM: -- Sorry to make you confused though, a master branch in spark does; {code} import org.apache.spark.sql._ import org.apache.spark.sql.types._ val schema = StructType(StructField(x, BinaryType, nullable=false) :: Nil) val data = sc.parallelize(Row(Array[Byte](1.toByte)) :: Row(Array[Byte](1.toByte)) :: Row(Array[Byte](2.toByte)) :: Nil) val df = sqlContext.createDataFrame(data, schema) df.registerTempTable(test) sqlContext.sql(SELECT DISTINCT x FROM test).show() +---+ | x| +---+ |[1]| |[2]| +---+ {code} was (Author: maropu): Sorry to make you confused though, a master branch in spark does; ``` import org.apache.spark.sql._ import org.apache.spark.sql.types._ val schema = StructType(StructField(x, BinaryType, nullable=false) :: Nil) val data = sc.parallelize(Row(Array[Byte](1.toByte)) :: Row(Array[Byte](1.toByte)) :: Row(Array[Byte](2.toByte)) :: Nil) val df = sqlContext.createDataFrame(data, schema) df.registerTempTable(test) sqlContext.sql(SELECT DISTINCT x FROM test).show() +---+ | x| +---+ |[1]| |[2]| +---+ ``` Create a wrapper for BinaryType --- Key: SPARK-8786 URL: https://issues.apache.org/jira/browse/SPARK-8786 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu The hashCode and equals() of Array[Byte] does check the bytes, we should create a wrapper (internally) to do that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9464) Add property-based tests for UTF8String
[ https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9464: --- Assignee: Apache Spark Add property-based tests for UTF8String --- Key: SPARK-9464 URL: https://issues.apache.org/jira/browse/SPARK-9464 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen Assignee: Apache Spark Priority: Critical UTF8String is a class that can benefit from ScalaCheck-style property checks. Let's add these. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9464) Add property-based tests for UTF8String
[ https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9464: -- Assignee: (was: Josh Rosen) Add property-based tests for UTF8String --- Key: SPARK-9464 URL: https://issues.apache.org/jira/browse/SPARK-9464 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen UTF8String is a class that can benefit from ScalaCheck-style property checks. Let's add these. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9464) Add property-based tests for UTF8String
[ https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9464: --- Assignee: (was: Apache Spark) Add property-based tests for UTF8String --- Key: SPARK-9464 URL: https://issues.apache.org/jira/browse/SPARK-9464 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen Priority: Critical UTF8String is a class that can benefit from ScalaCheck-style property checks. Let's add these. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9464) Add property-based tests for UTF8String
[ https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647252#comment-14647252 ] Josh Rosen commented on SPARK-9464: --- Unassigning this myself since I don't have time to work on this in the short term. Feel free to use my WIP PR as a starting point. Add property-based tests for UTF8String --- Key: SPARK-9464 URL: https://issues.apache.org/jira/browse/SPARK-9464 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen UTF8String is a class that can benefit from ScalaCheck-style property checks. Let's add these. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9464) Add property-based tests for UTF8String
[ https://issues.apache.org/jira/browse/SPARK-9464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-9464: -- Target Version/s: 1.5.0 Priority: Critical (was: Major) Add property-based tests for UTF8String --- Key: SPARK-9464 URL: https://issues.apache.org/jira/browse/SPARK-9464 Project: Spark Issue Type: New Feature Components: SQL Reporter: Josh Rosen Priority: Critical UTF8String is a class that can benefit from ScalaCheck-style property checks. Let's add these. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7264) SparkR API for parallel functions
[ https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647268#comment-14647268 ] Rick Moritz commented on SPARK-7264: I've also added a bit of commentary. SparkR API for parallel functions - Key: SPARK-7264 URL: https://issues.apache.org/jira/browse/SPARK-7264 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Shivaram Venkataraman This is a JIRA to discuss design proposals for enabling parallel R computation in SparkR without exposing the entire RDD API. The rationale for this is that the RDD API has a number of low level functions and we would like to expose a more light-weight API that is both friendly to R users and easy to maintain. http://goo.gl/GLHKZI has a first cut design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9231) DistributedLDAModel method for top topics per document
[ https://issues.apache.org/jira/browse/SPARK-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647233#comment-14647233 ] Apache Spark commented on SPARK-9231: - User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/7785 DistributedLDAModel method for top topics per document -- Key: SPARK-9231 URL: https://issues.apache.org/jira/browse/SPARK-9231 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Original Estimate: 48h Remaining Estimate: 48h Helper method in DistributedLDAModel of this form: {code} /** * For each document, return the top k weighted topics for that document. * @return RDD of (doc ID, topic indices, topic weights) */ def topTopicsPerDocument(k: Int): RDD[(Long, Array[Int], Array[Double])] {code} I believe the above method signature will be Java-friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8005) Support INPUT__FILE__NAME virtual column
[ https://issues.apache.org/jira/browse/SPARK-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8005. Resolution: Fixed Assignee: Joseph Batchik Fix Version/s: 1.5.0 Support INPUT__FILE__NAME virtual column Key: SPARK-8005 URL: https://issues.apache.org/jira/browse/SPARK-8005 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Fix For: 1.5.0 INPUT__FILE__NAME: input file name. One way to do this is to do it through a thread local variable in the SqlNewHadoopRDD.scala, and read that thread local variable in an expression. (similar to SparkPartitionID expression) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8007. Resolution: Won't Fix These are now just functions. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9396) Spark yarn allocator does not call removeContainerRequest for allocated Container requests, resulting in bloated ask[] toYarn RM.
[ https://issues.apache.org/jira/browse/SPARK-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647258#comment-14647258 ] prakhar jauhari commented on SPARK-9396: Can you please assign this issue to me, I am adding a PR. Spark yarn allocator does not call removeContainerRequest for allocated Container requests, resulting in bloated ask[] toYarn RM. --- Key: SPARK-9396 URL: https://issues.apache.org/jira/browse/SPARK-9396 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.1 Environment: Spark-1.2.1 on hadoop-yarn-2.4.0 cluster. All servers in cluster running Linux version 2.6.32. Reporter: prakhar jauhari Note : Attached logs contain logs that i added (spark yarn allocator side and Yarn client side) for debugging purpose. ! My spark job is configured for 2 executors, on killing 1 executor the ask is of 3 !!! On killing a executor - resource request logs : *Killed container: ask for 3 containers, instead for 1*** 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Will allocate 1 executor containers, each with 2432 MB memory including 384 MB overhead 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: numExecutors: 1 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: host preferences is empty 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Container request (host: Any, priority: 1, capability: memory:2432, vCores:4 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : allocate: this.ask = [{Priority: 1, Capability: memory:2432, vCores:4, # Containers: 3, Location: *, Relax Locality: true}] 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : allocate: allocateRequest = ask { priority{ priority: 1 } resource_name: * capability { memory: 2432 virtual_cores: 4 } num_containers: 3 relax_locality: true } blacklist_request { } response_id: 354 progress: 0.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8002) Support virtual columns in SQL and DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8002. Resolution: Fixed Assignee: Reynold Xin Fix Version/s: 1.5.0 We ended up just creating functions to support these. Support virtual columns in SQL and DataFrames - Key: SPARK-8002 URL: https://issues.apache.org/jira/browse/SPARK-8002 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647289#comment-14647289 ] Apache Spark commented on SPARK-6319: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7787 DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Priority: Critical Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6319: --- Assignee: (was: Apache Spark) DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Priority: Critical Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6319: --- Assignee: Apache Spark DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Assignee: Apache Spark Priority: Critical Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9239) HiveUDAF support for AggregateFunction2
[ https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9239: --- Assignee: Apache Spark HiveUDAF support for AggregateFunction2 --- Key: SPARK-9239 URL: https://issues.apache.org/jira/browse/SPARK-9239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Blocker We need to build a wrapper for Hive UDAFs on top of AggregateFunction2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9239) HiveUDAF support for AggregateFunction2
[ https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9239: --- Assignee: (was: Apache Spark) HiveUDAF support for AggregateFunction2 --- Key: SPARK-9239 URL: https://issues.apache.org/jira/browse/SPARK-9239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker We need to build a wrapper for Hive UDAFs on top of AggregateFunction2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9239) HiveUDAF support for AggregateFunction2
[ https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647291#comment-14647291 ] Apache Spark commented on SPARK-9239: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7788 HiveUDAF support for AggregateFunction2 --- Key: SPARK-9239 URL: https://issues.apache.org/jira/browse/SPARK-9239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker We need to build a wrapper for Hive UDAFs on top of AggregateFunction2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647848#comment-14647848 ] Sean Owen commented on SPARK-9477: -- It's not part of Spark or supported by the project; IMHO no it would not belong in the Spark docs. Standalone/YARN/Mesos are directly supported by code within Spark. I think the closest thing Spark has to that is a powered by wiki page, which lists third-party projects/products/services related to Spark: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9277: - Assignee: Sean Owen SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Assignee: Sean Owen Priority: Minor Labels: starter Fix For: 1.5.0 Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9248: - Assignee: Yu Ishikawa Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Priority: Minor Fix For: 1.5.0 Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9248) Closing curly-braces should always be on their own line
[ https://issues.apache.org/jira/browse/SPARK-9248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9248. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7795 [https://github.com/apache/spark/pull/7795] Closing curly-braces should always be on their own line --- Key: SPARK-9248 URL: https://issues.apache.org/jira/browse/SPARK-9248 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Priority: Minor Fix For: 1.5.0 Closing curly-braces should always be on their own line For example, {noformat} inst/tests/test_sparkSQL.R:606:3: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, error = function(err) { ^ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647886#comment-14647886 ] Stacy Pedersen commented on SPARK-9477: --- Fair enough, how about a link at the bottom of the supplemental Spark projects page for now? https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects ? I think this is a good start. Also we don't need code in Spark to support integration with Platform EGO for resource management. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9477) Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos).
[ https://issues.apache.org/jira/browse/SPARK-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647910#comment-14647910 ] Sean Owen commented on SPARK-9477: -- Seems reasonable to me -- anybody else have an opinion? If not after a day or two I'll update the wiki. Adding IBM Platform Application Service Controller into Spark documentation as a supported Cluster Manager (beside Yarn and Mesos). Key: SPARK-9477 URL: https://issues.apache.org/jira/browse/SPARK-9477 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.1 Reporter: Stacy Pedersen Priority: Minor Fix For: 1.4.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9225) LDASuite needs unit tests for empty documents
[ https://issues.apache.org/jira/browse/SPARK-9225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9225. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7620 [https://github.com/apache/spark/pull/7620] LDASuite needs unit tests for empty documents - Key: SPARK-9225 URL: https://issues.apache.org/jira/browse/SPARK-9225 Project: Spark Issue Type: Test Components: MLlib Reporter: Feynman Liang Assignee: Meihua Wu Priority: Minor Labels: starter Fix For: 1.5.0 We need to add a unit test to {{LDASuite}} which check that empty documents are handled appropriately without crashing. This would require defining an empty corpus within {{LDASuite}} and adding tests for the available LDA optimizers (currently EM and Online). Note that only {{SparseVector}}s can be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData
Wenchen Fan created SPARK-9480: -- Summary: Create an map abstract class MapData and a default implementation backed by 2 ArrayData Key: SPARK-9480 URL: https://issues.apache.org/jira/browse/SPARK-9480 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData
[ https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9480: --- Assignee: (was: Apache Spark) Create an map abstract class MapData and a default implementation backed by 2 ArrayData --- Key: SPARK-9480 URL: https://issues.apache.org/jira/browse/SPARK-9480 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData
[ https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647888#comment-14647888 ] Apache Spark commented on SPARK-9480: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7799 Create an map abstract class MapData and a default implementation backed by 2 ArrayData --- Key: SPARK-9480 URL: https://issues.apache.org/jira/browse/SPARK-9480 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9480) Create an map abstract class MapData and a default implementation backed by 2 ArrayData
[ https://issues.apache.org/jira/browse/SPARK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9480: --- Assignee: Apache Spark Create an map abstract class MapData and a default implementation backed by 2 ArrayData --- Key: SPARK-9480 URL: https://issues.apache.org/jira/browse/SPARK-9480 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature
[ https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647937#comment-14647937 ] Apache Spark commented on SPARK-9077: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7800 Improve error message for decision trees when numExamples maxCategoriesPerFeature --- Key: SPARK-9077 URL: https://issues.apache.org/jira/browse/SPARK-9077 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h See [SPARK-9075]'s discussion for details. We should improve the current error message to recommend that the user remove the high-arity categorical features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature
[ https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9077: --- Assignee: (was: Apache Spark) Improve error message for decision trees when numExamples maxCategoriesPerFeature --- Key: SPARK-9077 URL: https://issues.apache.org/jira/browse/SPARK-9077 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h See [SPARK-9075]'s discussion for details. We should improve the current error message to recommend that the user remove the high-arity categorical features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9077) Improve error message for decision trees when numExamples maxCategoriesPerFeature
[ https://issues.apache.org/jira/browse/SPARK-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9077: --- Assignee: Apache Spark Improve error message for decision trees when numExamples maxCategoriesPerFeature --- Key: SPARK-9077 URL: https://issues.apache.org/jira/browse/SPARK-9077 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h See [SPARK-9075]'s discussion for details. We should improve the current error message to recommend that the user remove the high-arity categorical features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9390) Create an array abstract class ArrayData and a default implementation backed by Array[Object]
[ https://issues.apache.org/jira/browse/SPARK-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-9390. Resolution: Fixed Fix Version/s: 1.5.0 Create an array abstract class ArrayData and a default implementation backed by Array[Object] - Key: SPARK-9390 URL: https://issues.apache.org/jira/browse/SPARK-9390 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Fix For: 1.5.0 {code} interface ArrayData implements SpecializedGetters { int numElements(); int sizeInBytes(); } {code} We should also add to SpecializedGetters a method to get array, i.e. {code} interface SpecializedGetters { ... ArrayData getArray(int ordinal); ... } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9267) Remove highly unnecessary accumulators stringify methods
[ https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9267. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7678 [https://github.com/apache/spark/pull/7678] Remove highly unnecessary accumulators stringify methods Key: SPARK-9267 URL: https://issues.apache.org/jira/browse/SPARK-9267 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Priority: Trivial Fix For: 1.5.0 {code} def stringifyPartialValue(partialValue: Any): String = %s.format(partialValue) def stringifyValue(value: Any): String = %s.format(value) {code} These are only used in 1 place (DAGScheduler). The level of indirection actually makes the code harder to read without an editor. We should just inline them... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9267) Remove highly unnecessary accumulators stringify methods
[ https://issues.apache.org/jira/browse/SPARK-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9267: - Assignee: François Garillot Remove highly unnecessary accumulators stringify methods Key: SPARK-9267 URL: https://issues.apache.org/jira/browse/SPARK-9267 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: François Garillot Priority: Trivial Fix For: 1.5.0 {code} def stringifyPartialValue(partialValue: Any): String = %s.format(partialValue) def stringifyValue(value: Any): String = %s.format(value) {code} These are only used in 1 place (DAGScheduler). The level of indirection actually makes the code harder to read without an editor. We should just inline them... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6486) Add BlockMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6486: - Assignee: Mike Dusenberry Add BlockMatrix in PySpark -- Key: SPARK-6486 URL: https://issues.apache.org/jira/browse/SPARK-6486 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Mike Dusenberry We should add BlockMatrix to PySpark. Internally, we can use DataFrames and MatrixUDT for serialization. This JIRA should contain conversions between IndexedRowMatrix/CoordinateMatrix to block matrices. But this does NOT cover linear algebra operations of block matrices. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9489) Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange
Josh Rosen created SPARK-9489: - Summary: Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange Key: SPARK-9489 URL: https://issues.apache.org/jira/browse/SPARK-9489 Project: Spark Issue Type: Improvement Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen While reviewing [~yhuai]'s patch for SPARK-2205, I noticed that Exchange's {{compatible}} check may be incorrectly returning {{false}} in many cases. As far as I know, this is not actually a problem because the {{compatible}}, {{meetsRequirements}}, and {{needsAnySort}} checks are serving only as short-circuit performance optimizations that are not necessary for correctness. In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children. This should be safe because we rewrite the tree in a single bottom-up pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9493) Chain logistic regression with isotonic regression under the pipeline API
Xiangrui Meng created SPARK-9493: Summary: Chain logistic regression with isotonic regression under the pipeline API Key: SPARK-9493 URL: https://issues.apache.org/jira/browse/SPARK-9493 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng One use case of isotonic regression is to calibrate the probabilities output by logistic regression. We should make this easier in the pipeline API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9408) Refactor mllib/linalg.py to mllib/linalg
[ https://issues.apache.org/jira/browse/SPARK-9408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9408. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7746 [https://github.com/apache/spark/pull/7746] Refactor mllib/linalg.py to mllib/linalg Key: SPARK-9408 URL: https://issues.apache.org/jira/browse/SPARK-9408 Project: Spark Issue Type: Task Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Fix For: 1.5.0 We need to refactor mllib/linalg.py to mllib/linalg so that the project structure is similar to that of Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility
[ https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9361: Issue Type: Sub-task (was: Improvement) Parent: SPARK-4366 Refactor new aggregation code to reduce the times of checking compatibility --- Key: SPARK-9361 URL: https://issues.apache.org/jira/browse/SPARK-9361 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Liang-Chi Hsieh Currently, we call aggregate.Utils.tryConvert in many places to check it the logical.aggregate can be run with new aggregation. But looks like aggregate.Utils.tryConvert costs much time to run. We should only call tryConvert once and keep it value in logical.aggregate and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility
[ https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-9361. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7677 [https://github.com/apache/spark/pull/7677] Refactor new aggregation code to reduce the times of checking compatibility --- Key: SPARK-9361 URL: https://issues.apache.org/jira/browse/SPARK-9361 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Liang-Chi Hsieh Fix For: 1.5.0 Currently, we call aggregate.Utils.tryConvert in many places to check it the logical.aggregate can be run with new aggregation. But looks like aggregate.Utils.tryConvert costs much time to run. We should only call tryConvert once and keep it value in logical.aggregate and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9361) Refactor new aggregation code to reduce the times of checking compatibility
[ https://issues.apache.org/jira/browse/SPARK-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9361: Assignee: Liang-Chi Hsieh Refactor new aggregation code to reduce the times of checking compatibility --- Key: SPARK-9361 URL: https://issues.apache.org/jira/browse/SPARK-9361 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Currently, we call aggregate.Utils.tryConvert in many places to check it the logical.aggregate can be run with new aggregation. But looks like aggregate.Utils.tryConvert costs much time to run. We should only call tryConvert once and keep it value in logical.aggregate and reuse it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8297) Scheduler backend is not notified in case node fails in YARN
[ https://issues.apache.org/jira/browse/SPARK-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-8297. --- Resolution: Fixed Fix Version/s: 1.5.0 Scheduler backend is not notified in case node fails in YARN Key: SPARK-8297 URL: https://issues.apache.org/jira/browse/SPARK-8297 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.2, 1.3.1, 1.4.1, 1.5.0 Environment: Spark on yarn - both client and cluster mode. Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Fix For: 1.5.0 When a node crashes, yarn detects the failure and notifies spark - but this information is not propagated to scheduler backend (unlike in mesos mode, for example). It results in repeated re-execution of stages (due to FetchFailedException on shuffle side), resulting finally in application failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9388) Make log messages in ExecutorRunnable more readable
[ https://issues.apache.org/jira/browse/SPARK-9388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-9388. --- Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.5.0 Make log messages in ExecutorRunnable more readable --- Key: SPARK-9388 URL: https://issues.apache.org/jira/browse/SPARK-9388 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.5.0 Reporter: Marcelo Vanzin Assignee: Marcelo Vanzin Priority: Trivial Fix For: 1.5.0 There's a couple of debug messages printed in ExecutorRunnable containing information about the container being started. They're printed all in one line, which makes them - especially the one containing the process's environment - hard to read. We should make them nicer (like the similar one printed by Client.scala). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648025#comment-14648025 ] Shivaram Venkataraman commented on SPARK-9437: -- Resolved by https://github.com/apache/spark/pull/7750 SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor Fix For: 1.5.0 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9437) SizeEstimator overflows for primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9437. -- Resolution: Fixed Fix Version/s: 1.5.0 SizeEstimator overflows for primitive arrays Key: SPARK-9437 URL: https://issues.apache.org/jira/browse/SPARK-9437 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor Fix For: 1.5.0 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you have an {{Array[Double]}} of size 1 28. This means that when you try to broadcast a large primitive array, you get: {noformat} java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -2147483608 at scala.Predef$.require(Predef.scala:233) at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9481: --- Assignee: (was: Apache Spark) LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9481: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-5572) LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647991#comment-14647991 ] Feynman Liang commented on SPARK-9481: -- Working on this LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9481) LocalLDAModel logLikelihood
Feynman Liang created SPARK-9481: Summary: LocalLDAModel logLikelihood Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8850) Turn unsafe mode on by default
[ https://issues.apache.org/jira/browse/SPARK-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-8850. Resolution: Fixed Fix Version/s: 1.5.0 Turn unsafe mode on by default -- Key: SPARK-8850 URL: https://issues.apache.org/jira/browse/SPARK-8850 Project: Spark Issue Type: Task Components: SQL Reporter: Reynold Xin Assignee: Josh Rosen Fix For: 1.5.0 Let's turn unsafe on and see what bugs we find in preparation for 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648039#comment-14648039 ] Apache Spark commented on SPARK-9481: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7801 LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9481) LocalLDAModel logLikelihood
[ https://issues.apache.org/jira/browse/SPARK-9481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9481: --- Assignee: Apache Spark LocalLDAModel logLikelihood --- Key: SPARK-9481 URL: https://issues.apache.org/jira/browse/SPARK-9481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang Assignee: Apache Spark Priority: Trivial We already have a variational {{bound}} method so we should provide a public {{logLikelihood}} that uses the model's parameters -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9282) Filter on Spark DataFrame with multiple columns
[ https://issues.apache.org/jira/browse/SPARK-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal reopened SPARK-9282: On using '' instead of 'and', the following error occurs: Py4JError Traceback (most recent call last) ipython-input-8-b3101afeeb7a in module() 1 df1.filter(df1.age 21 df1.age 45).show(10) /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py in _(self, other) 999 def _(self, other): 1000 jc = other._jc if isinstance(other, Column) else other - 1001 njc = getattr(self._jc, name)(jc) 1002 return Column(njc) 1003 _.__doc__ = doc /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 302 raise Py4JError( 303 'An error occurred while calling {0}{1}{2}. Trace:\n{3}\n'. -- 304 format(target_id, '.', name, value)) 305 else: 306 raise Py4JError( Py4JError: An error occurred while calling o83.and. Trace: py4j.Py4JException: Method and([class java.lang.Integer]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Filter on Spark DataFrame with multiple columns --- Key: SPARK-9282 URL: https://issues.apache.org/jira/browse/SPARK-9282 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell, SQL Affects Versions: 1.3.0 Environment: CDH 5.0 on CentOS6 Reporter: Sandeep Pal Filter on dataframe does not work if we have more than one column inside the filter. Nonetheless, it works on an RDD. Following is the example: df1.show() age coolid depid empname 23 7 1 sandeep 21 8 2 john 24 9 1 cena 45 12 3 bob 20 7 4 tanay 12 8 5 gaurav df1.filter(df1.age 21 and df1.age 45).show(10) 23 7 1 sandeep 21 8 2 john - 24 9 1 cena 20 7 4 tanay - 12 8 5 gaurav -- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
[ https://issues.apache.org/jira/browse/SPARK-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648126#comment-14648126 ] Xiangrui Meng commented on SPARK-8497: -- Please provide the algorithm you want to implement, which should be based on some published work for correctness. I don't know how to handle the exponential growth of number of cliques. For example, if we have a clique of size 40, there will be (40 choose 20) cliques of size 20, which is more than 100 billion. Graph Clique(Complete Connected Sub-graph) Discovery Algorithm -- Key: SPARK-8497 URL: https://issues.apache.org/jira/browse/SPARK-8497 Project: Spark Issue Type: New Feature Components: GraphX, ML, MLlib, Spark Core Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 72h Remaining Estimate: 72h In recent years, social network industry has high demand on Complete Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph connection from Twitter, the calls and other activities from telecoms world form a huge social graph, and due to the nature of communication method, it shows the strongest inter-person relationship, the graph based analysis will reveal tremendous value from telecoms connections. We need an algorithm in Spark to figure out ALL the strongest completely connected sub-graph (so called Clique here) for EVERY person in the network which will be one of the start point for understanding user's social behaviour. In Huawei, we have many real-world use cases that invovle telecom social graph of tens billion edges and hundreds million vertices, and the cliques will be also in tens million level. The graph will be a fast changing one which means we need to analyse the graph pattern very often (one result per day/week for moving time window which spans multiple months). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula
[ https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9463: - Assignee: Eric Liang Expose model coefficients with names in SparkR RFormula --- Key: SPARK-9463 URL: https://issues.apache.org/jira/browse/SPARK-9463 Project: Spark Issue Type: Improvement Components: ML, SparkR Reporter: Eric Liang Assignee: Eric Liang Currently you cannot retrieve model statistics from the R side, we should at least allow showing the coefficients for 1.5 Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9483) UTF8String.getPrefix only works in little-endian order
Reynold Xin created SPARK-9483: -- Summary: UTF8String.getPrefix only works in little-endian order Key: SPARK-9483 URL: https://issues.apache.org/jira/browse/SPARK-9483 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6805: - Summary: MLlib + SparkR integration for 1.5 (was: ML Pipeline API in SparkR) MLlib + SparkR integration for 1.5 -- Key: SPARK-6805 URL: https://issues.apache.org/jira/browse/SPARK-6805 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng Priority: Critical SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python. For Spark 1.5, we want to support linear/logistic regression in SparkR, with basic support for R formula and elastic-net regularization. The design doc can be viewed at https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6805: - Assignee: Eric Liang MLlib + SparkR integration for 1.5 -- Key: SPARK-6805 URL: https://issues.apache.org/jira/browse/SPARK-6805 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng Assignee: Eric Liang Priority: Critical SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python. For Spark 1.5, we want to support linear/logistic regression in SparkR, with basic support for R formula and elastic-net regularization. The design doc can be viewed at https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9463) Expose model coefficients with names in SparkR RFormula
[ https://issues.apache.org/jira/browse/SPARK-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9463: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-6805 Expose model coefficients with names in SparkR RFormula --- Key: SPARK-9463 URL: https://issues.apache.org/jira/browse/SPARK-9463 Project: Spark Issue Type: Sub-task Components: ML, SparkR Reporter: Eric Liang Assignee: Eric Liang Currently you cannot retrieve model statistics from the R side, we should at least allow showing the coefficients for 1.5 Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6805) MLlib + SparkR integration for 1.5
[ https://issues.apache.org/jira/browse/SPARK-6805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6805: - Description: --SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python.-- We limited the scope of this JIRA to MLlib + SparkR integration for 1.5. For Spark 1.5, we want to support linear/logistic regression in SparkR, with basic support for R formula and elastic-net regularization. The design doc can be viewed at https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing was: ~~SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python.~~ We limited the scope of this JIRA to MLlib + SparkR integration for 1.5. For Spark 1.5, we want to support linear/logistic regression in SparkR, with basic support for R formula and elastic-net regularization. The design doc can be viewed at https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing MLlib + SparkR integration for 1.5 -- Key: SPARK-6805 URL: https://issues.apache.org/jira/browse/SPARK-6805 Project: Spark Issue Type: Umbrella Components: ML, SparkR Reporter: Xiangrui Meng Assignee: Eric Liang Priority: Critical --SparkR was merged. So let's have this umbrella JIRA for the ML pipeline API in SparkR. The implementation should be similar to the pipeline API implementation in Python.-- We limited the scope of this JIRA to MLlib + SparkR integration for 1.5. For Spark 1.5, we want to support linear/logistic regression in SparkR, with basic support for R formula and elastic-net regularization. The design doc can be viewed at https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9471) Multilayer perceptron
Alexander Ulanov created SPARK-9471: --- Summary: Multilayer perceptron Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9471) Multilayer perceptron
[ https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9471: --- Assignee: Apache Spark Multilayer perceptron -- Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Assignee: Apache Spark Fix For: 1.4.0 Original Estimate: 8,736h Remaining Estimate: 8,736h Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9471) Multilayer perceptron
[ https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647438#comment-14647438 ] Apache Spark commented on SPARK-9471: - User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/7621 Multilayer perceptron -- Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 8,736h Remaining Estimate: 8,736h Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9471) Multilayer perceptron
[ https://issues.apache.org/jira/browse/SPARK-9471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9471: --- Assignee: (was: Apache Spark) Multilayer perceptron -- Key: SPARK-9471 URL: https://issues.apache.org/jira/browse/SPARK-9471 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 1.4.0 Reporter: Alexander Ulanov Fix For: 1.4.0 Original Estimate: 8,736h Remaining Estimate: 8,736h Implement Multilayer Perceptron for Spark ML. Requirements: 1) ML pipelines interface 2) Extensible internal interface for further development of artificial neural networks for ML 3) Efficient and scalable: use vectors and BLAS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8622) Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath
[ https://issues.apache.org/jira/browse/SPARK-8622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647440#comment-14647440 ] Baswaraj commented on SPARK-8622: - Any update on this ? Spark 1.3.1 and 1.4.0 doesn't put executor working directory on executor classpath -- Key: SPARK-8622 URL: https://issues.apache.org/jira/browse/SPARK-8622 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1, 1.4.0 Reporter: Baswaraj I ran into an issue that executor not able to pickup my configs/ function from my custom jar in standalone (client/cluster) deploy mode. I have used spark-submit --Jar option to specify all my jars and configs to be used by executors. all these files are placed in working directory of executor, but not in executor classpath. Also, executor working directory is not in executor classpath. I am expecting executor to find all files specified in spark-submit --jar options . In spark 1.3.0 executor working directory is in executor classpath, so app runs successfully. To successfully run my application with spark 1.3.1 +, i have to use following option (conf/spark-defaults.conf) spark.executor.extraClassPath . Please advice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647441#comment-14647441 ] Imran Rashid commented on SPARK-3644: - [~zxzxy1988] The test is here https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/history/HistoryServerSuite.scala It references test files here: https://github.com/apache/spark/tree/master/core/src/test/resources/HistoryServerExpectations REST API for Spark application info (jobs / stages / tasks / storage info) -- Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Reporter: Josh Rosen Assignee: Imran Rashid Fix For: 1.4.0 This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647443#comment-14647443 ] Sean Owen commented on SPARK-9470: -- I feel pretty strongly this is not worth it. I'd like to close shortly unless there is a strenuous objection. Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4492) Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647446#comment-14647446 ] sam commented on SPARK-4492: I imagine building a fat jar for running with `java -cp` is possible, but I have never managed to get it to work. It would be great if upon each release of Spark, an example build file could be provided. Exception when following SimpleApp tutorial java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil -- Key: SPARK-4492 URL: https://issues.apache.org/jira/browse/SPARK-4492 Project: Spark Issue Type: Bug Reporter: sam When I follow the example here https://spark.apache.org/docs/1.0.2/quick-start.html and run with java -cp my.jar my.main.Class with master set to yarn-client I get the below exception. Exception in thread main java.lang.ExceptionInInitializerError at org.apache.spark.SparkContext.init(SparkContext.scala:228) at com.barclays.SimpleApp$.main(SimpleApp.scala:11) at com.barclays.SimpleApp.main(SimpleApp.scala) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:106) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:101) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 3 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:169) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:102) ... 5 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647457#comment-14647457 ] Rahul Kavale edited comment on SPARK-9470 at 7/30/15 10:49 AM: --- Hi Sean, I just wanted to do this small cleanup which felt like obviously redundant code to me. was (Author: rahulkavale): Hi Sean, I just wanted to do this small cleanup which felt like obviously redundant code for me. Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647457#comment-14647457 ] Rahul Kavale commented on SPARK-9470: - Hi Sean, I just wanted to do this small cleanup which felt like obviously redundant code for me. Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647473#comment-14647473 ] Sean Owen commented on SPARK-9470: -- [~rahulkavale] see my comment on the PR. I also follow this convention, but, many people don't because it's not obviously redundant -- some would argue that interface methods should _always_ be marked {{public}} because they are always implicitly {{public}} and removing the access modifier makes it look to those who don't know the difference in behavior that these are package-private. Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9359) Support IntervalType for Parquet
[ https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9359: --- Assignee: Apache Spark Support IntervalType for Parquet Key: SPARK-9359 URL: https://issues.apache.org/jira/browse/SPARK-9359 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian Assignee: Apache Spark SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet {{INTERVAL}} logical type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9359) Support IntervalType for Parquet
[ https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647479#comment-14647479 ] Apache Spark commented on SPARK-9359: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/7793 Support IntervalType for Parquet Key: SPARK-9359 URL: https://issues.apache.org/jira/browse/SPARK-9359 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet {{INTERVAL}} logical type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9359) Support IntervalType for Parquet
[ https://issues.apache.org/jira/browse/SPARK-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9359: --- Assignee: (was: Apache Spark) Support IntervalType for Parquet Key: SPARK-9359 URL: https://issues.apache.org/jira/browse/SPARK-9359 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0 Reporter: Cheng Lian SPARK-8753 introduced {{IntervalType}} which corresponds to Parquet {{INTERVAL}} logical type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647406#comment-14647406 ] Thomas Demoor commented on SPARK-7481: -- Pulled the aws-upgrade out of HADOOP-11684 to a separate issue HADOOP-12269. Only uses aws-sdk-s3-1.10.6 instead of the entire sdk. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8838) Add config to enable/disable merging part-files when merging parquet schema
[ https://issues.apache.org/jira/browse/SPARK-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-8838. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7238 [https://github.com/apache/spark/pull/7238] Add config to enable/disable merging part-files when merging parquet schema --- Key: SPARK-8838 URL: https://issues.apache.org/jira/browse/SPARK-8838 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647432#comment-14647432 ] Robin East commented on SPARK-5692: --- Hi the description includes the sentence 'We may want to discuss whether we want to be compatible with the original Word2Vec model storage format.'. Was this ever discussed - I can't see anything in comment stream for this JIRA. Is there any interest in adding functionality to import Word2Vec models from the original binary format (e.g. the 300 million word Google News model). Model import/export for Word2Vec Key: SPARK-5692 URL: https://issues.apache.org/jira/browse/SPARK-5692 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Manoj Kumar Fix For: 1.4.0 Supoort save and load for Word2VecModel. We may want to discuss whether we want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647380#comment-14647380 ] Manoj Kumar commented on SPARK-9277: I will not have access to a development environment till Saturday. Feel free to fix it. Thanks. SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647487#comment-14647487 ] Rahul Kavale commented on SPARK-9470: - [~srowen] I see your point, just felt if we all know methods in an interface are implicitly public, then what value does it add making them 'public' explicitly. Anyways, thanks for your comment. Closing the issue. Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9470) Java API function interface cleanup
[ https://issues.apache.org/jira/browse/SPARK-9470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Kavale closed SPARK-9470. --- Resolution: Won't Fix Java API function interface cleanup --- Key: SPARK-9470 URL: https://issues.apache.org/jira/browse/SPARK-9470 Project: Spark Issue Type: Improvement Reporter: Rahul Kavale Priority: Trivial Hi guys, I was exploring Spark codebase, and came across the Java API function interfaces. The interfaces have the 'call' method as 'public' which is redundant. https://github.com/apache/spark/pull/7790 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9277: --- Assignee: Apache Spark SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Assignee: Apache Spark Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9277) SparseVector constructor must throw an error when declared number of elements less than array length
[ https://issues.apache.org/jira/browse/SPARK-9277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647490#comment-14647490 ] Apache Spark commented on SPARK-9277: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7794 SparseVector constructor must throw an error when declared number of elements less than array length Key: SPARK-9277 URL: https://issues.apache.org/jira/browse/SPARK-9277 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Andrey Vykhodtsev Priority: Minor Labels: starter Attachments: SparseVector test.html, SparseVector test.ipynb I found that one can create SparseVector inconsistently and it will lead to an Java error in runtime, for example when training LogisticRegressionWithSGD. Here is the test case: In [2]: sc.version Out[2]: u'1.3.1' In [13]: from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD In [3]: x = SparseVector(2, {1:1, 2:2, 3:3, 4:4, 5:5}) In [10]: l = LabeledPoint(0, x) In [12]: r = sc.parallelize([l]) In [14]: m = LogisticRegressionWithSGD.train(r) Error: Py4JJavaError: An error occurred while calling o86.trainLogisticRegressionModelWithSGD. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 11.0 failed 1 times, most recent failure: Lost task 7.0 in stage 11.0 (TID 47, localhost): java.lang.ArrayIndexOutOfBoundsException: 2 Attached is the notebook with the scenario and the full message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org