[jira] [Created] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
guowei created SPARK-8865: - Summary: Fix bug: init SimpleConsumerConfig with kafka params Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
[ https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616245#comment-14616245 ] Apache Spark commented on SPARK-8865: - User 'guowei2' has created a pull request for this issue: https://github.com/apache/spark/pull/7254 Fix bug: init SimpleConsumerConfig with kafka params - Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
[ https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8865: --- Assignee: (was: Apache Spark) Fix bug: init SimpleConsumerConfig with kafka params - Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3164) Store DecisionTree Split.categories as Set
[ https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246 ] Rekha Joshi commented on SPARK-3164: it was not assigned, nor closed/commented, and improvement suggestion needed in latest snapshot.but ack on decision made on api stability. thanks for letting me know [~josephkb] - - sad, as i updated 5 files for a trivial pursuit, with the scala style checks :-) :-) Store DecisionTree Split.categories as Set -- Key: SPARK-3164 URL: https://issues.apache.org/jira/browse/SPARK-3164 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0 Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log n lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
[ https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8865: --- Assignee: Apache Spark Fix bug: init SimpleConsumerConfig with kafka params - Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Assignee: Apache Spark Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3164) Store DecisionTree Split.categories as Set
[ https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246 ] Rekha Joshi edited comment on SPARK-3164 at 7/7/15 6:15 AM: it was not assigned, nor closed/commented, and improvement suggestion needed in latest snapshot.but ack on decision made on api stability. thanks for letting me know [~josephkb] sad, as i updated 5 files for a trivial pursuit, with the scala style checks :-) :-) was (Author: rekhajoshm): it was not assigned, nor closed/commented, and improvement suggestion needed in latest snapshot.but ack on decision made on api stability. thanks for letting me know [~josephkb] # sad, as i updated 5 files for a trivial pursuit, with the scala style checks :-) :-) Store DecisionTree Split.categories as Set -- Key: SPARK-3164 URL: https://issues.apache.org/jira/browse/SPARK-3164 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0 Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log n lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3164) Store DecisionTree Split.categories as Set
[ https://issues.apache.org/jira/browse/SPARK-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616246#comment-14616246 ] Rekha Joshi edited comment on SPARK-3164 at 7/7/15 6:15 AM: it was not assigned, nor closed/commented, and improvement suggestion needed in latest snapshot.but ack on decision made on api stability. thanks for letting me know [~josephkb] # sad, as i updated 5 files for a trivial pursuit, with the scala style checks :-) :-) was (Author: rekhajoshm): it was not assigned, nor closed/commented, and improvement suggestion needed in latest snapshot.but ack on decision made on api stability. thanks for letting me know [~josephkb] - - sad, as i updated 5 files for a trivial pursuit, with the scala style checks :-) :-) Store DecisionTree Split.categories as Set -- Key: SPARK-3164 URL: https://issues.apache.org/jira/browse/SPARK-3164 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0 Improvement: computation For categorical features with many categories, it could be more efficient to store Split.categories as a Set, not a List. (It is currently a List.) A Set might be more scalable (for log n lookups), though tests would need to be done to ensure that Sets do not incur too much more overhead than Lists. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-8868: --- Assignee: Yin Huai SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8840) Float type coercion with hiveContext
[ https://issues.apache.org/jira/browse/SPARK-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616918#comment-14616918 ] Shivaram Venkataraman commented on SPARK-8840: -- Sorry my question is can you reproduce this in Scala shell or PySpark shell ? Float type coercion with hiveContext Key: SPARK-8840 URL: https://issues.apache.org/jira/browse/SPARK-8840 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Evgeny SInelnikov Problem with +float+ type coercion on SparkR with hiveContext. {code} result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {code} This trouble looks like already exists (SPARK-2863 - Emulate Hive type coercion in native reimplementations of Hive functions) with same reason - not completed native reimplementations of Hive... not ...functions only. I used spark 1.4.0 binaries from official site: http://spark.apache.org/downloads.html And running it on: * Hortonworks HDP 2.2.0.0-2041 * with Hive 0.14 * with disabled hooks for Application Timeline Servers (ATSHook) in hive-site.xml, commented: ** hive.exec.failure.hooks, ** hive.exec.post.hooks, ** hive.exec.pre.hooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8821) The ec2 script doesn't run on python 3 with an utf8 env
[ https://issues.apache.org/jira/browse/SPARK-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-8821: - Assignee: Simon Hafner The ec2 script doesn't run on python 3 with an utf8 env --- Key: SPARK-8821 URL: https://issues.apache.org/jira/browse/SPARK-8821 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Environment: Archlinux, UTF8 LANG env Reporter: Simon Hafner Assignee: Simon Hafner Fix For: 1.4.2, 1.5.0 Otherwise the script will crash with - Downloading boto... Traceback (most recent call last): File ec2/spark_ec2.py, line 148, in module setup_external_libs(external_libs) File ec2/spark_ec2.py, line 128, in setup_external_libs if hashlib.md5(tar.read()).hexdigest() != lib[md5]: File /usr/lib/python3.4/codecs.py, line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte In case of an utf8 env setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8704: --- Summary: Add missing methods in StandardScaler (ML and PySpark) (was: Add missing methods in StandardScaler) Add missing methods in StandardScaler (ML and PySpark) -- Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8704: --- Summary: Add missing methods in StandardScaler (was: Add additional methods to wrappers in ml.pyspark.feature) Add missing methods in StandardScaler - Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8823) Optimizations for sparse vector products in pyspark.mllib.linalg
[ https://issues.apache.org/jira/browse/SPARK-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8823. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7222 [https://github.com/apache/spark/pull/7222] Optimizations for sparse vector products in pyspark.mllib.linalg Key: SPARK-8823 URL: https://issues.apache.org/jira/browse/SPARK-8823 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Priority: Minor Fix For: 1.5.0 Currently we iterate over indices and values of both the sparse vectors that can be vectorized in NumPy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8821) The ec2 script doesn't run on python 3 with an utf8 env
[ https://issues.apache.org/jira/browse/SPARK-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8821. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 Issue resolved by pull request 7215 [https://github.com/apache/spark/pull/7215] The ec2 script doesn't run on python 3 with an utf8 env --- Key: SPARK-8821 URL: https://issues.apache.org/jira/browse/SPARK-8821 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Environment: Archlinux, UTF8 LANG env Reporter: Simon Hafner Fix For: 1.4.2, 1.5.0 Otherwise the script will crash with - Downloading boto... Traceback (most recent call last): File ec2/spark_ec2.py, line 148, in module setup_external_libs(external_libs) File ec2/spark_ec2.py, line 128, in setup_external_libs if hashlib.md5(tar.read()).hexdigest() != lib[md5]: File /usr/lib/python3.4/codecs.py, line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte In case of an utf8 env setting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8873) Support cleaning up shuffle files for drivers launched with Mesos
Timothy Chen created SPARK-8873: --- Summary: Support cleaning up shuffle files for drivers launched with Mesos Key: SPARK-8873 URL: https://issues.apache.org/jira/browse/SPARK-8873 Project: Spark Issue Type: Improvement Reporter: Timothy Chen With dynamic allocation enabled with Mesos, drivers can launch with shuffle data cached in the external shuffle service. However, there is no reliable way to let the shuffle service clean up the shuffle data when the driver exits, since it may crash before it notifies the shuffle service and shuffle data will be cached forever. We need to implement a reliable way to detect driver termination and clean up shuffle data accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8871) Add maximal frequent itemsets filter in Spark MLib FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Svirsky updated SPARK-8871: Description: Maximal frequent itemsets can be exctracted as all root-to-leaf paths(sets) from FP-Trees. (was: Maximal frequent itemsets can be exctracted as all root-to-leaf paths from FP-Trees) Add maximal frequent itemsets filter in Spark MLib FPGrowth --- Key: SPARK-8871 URL: https://issues.apache.org/jira/browse/SPARK-8871 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jonathan Svirsky Maximal frequent itemsets can be exctracted as all root-to-leaf paths(sets) from FP-Trees. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8872: - Description: In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. (was: In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98.) Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8868: Target Version/s: 1.5.0 SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8868: --- Assignee: Yin Huai (was: Apache Spark) SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8868: --- Assignee: Apache Spark (was: Yin Huai) SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Apache Spark Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-8868: Shepherd: Josh Rosen SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617015#comment-14617015 ] Xiangrui Meng edited comment on SPARK-6485 at 7/7/15 5:20 PM: -- [~mwdus...@us.ibm.com] Thanks for working on this! Any ETA? Please keep the first PR minimal and re-use code from Scala. We can split this JIRA into small ones if necessary. was (Author: mengxr): [~mwdus...@us.ibm.com] Thanks for working on this! Any ETA? Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark -- Key: SPARK-6485 URL: https://issues.apache.org/jira/browse/SPARK-6485 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8744) StringIndexerModel should have public constructor
[ https://issues.apache.org/jira/browse/SPARK-8744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8744: - Target Version/s: 1.5.0 StringIndexerModel should have public constructor - Key: SPARK-8744 URL: https://issues.apache.org/jira/browse/SPARK-8744 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Trivial Labels: starter Original Estimate: 48h Remaining Estimate: 48h It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8870) Use SQLContext.getOrCreate in model save/load
[ https://issues.apache.org/jira/browse/SPARK-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8870: - Remaining Estimate: 2h Original Estimate: 2h Use SQLContext.getOrCreate in model save/load - Key: SPARK-8870 URL: https://issues.apache.org/jira/browse/SPARK-8870 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 2h Remaining Estimate: 2h In many model save/load code, we use `new SQLContext(sc)` to create a SQLContext. This could be replace by `SQLContext.getOrCreate(sc)` to reuse an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
[ https://issues.apache.org/jira/browse/SPARK-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616933#comment-14616933 ] Apache Spark commented on SPARK-8868: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7262 SqlSerializer2 can go into infinite loop when row consists only of NullType columns --- Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Assignee: Yin Huai Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6485: - Assignee: (was: Manoj Kumar) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark -- Key: SPARK-6485 URL: https://issues.apache.org/jira/browse/SPARK-6485 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment LGTM. For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent pattern mining improvements (SPARK-6487) * R-like stats for ML models (SPARK-7674) * Generalize classification threshold to multiclass (SPARK-8069) * A/B testing (SPARK-3147) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) * TrainValidationSplit for tuning (SPARK-8484) * Isotonic regression (SPARK-8671) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) * Python API for streaming ML algorithms (SPARK-3258) * Add missing model methods (SPARK-8633) h2. SparkR API for ML * ML Pipeline API in SparkR (SPARK-6805) * model.matrix for DataFrames (SPARK-6823) h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list
[jira] [Created] (SPARK-8871) Add maximal frequent itemsets filter in Spark MLib FPGrowth
Jonathan Svirsky created SPARK-8871: --- Summary: Add maximal frequent itemsets filter in Spark MLib FPGrowth Key: SPARK-8871 URL: https://issues.apache.org/jira/browse/SPARK-8871 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jonathan Svirsky Maximal frequent itemsets can be exctracted as all root-to-leaf paths from FP-Trees -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8711) Add additional methods to JavaModel wrappers in trees
[ https://issues.apache.org/jira/browse/SPARK-8711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8711. -- Resolution: Fixed Fix Version/s: 1.5.0 Add additional methods to JavaModel wrappers in trees - Key: SPARK-8711 URL: https://issues.apache.org/jira/browse/SPARK-8711 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6485) Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark
[ https://issues.apache.org/jira/browse/SPARK-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617015#comment-14617015 ] Xiangrui Meng commented on SPARK-6485: -- [~mwdus...@us.ibm.com] Thanks for working on this! Any ETA? Add CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark -- Key: SPARK-6485 URL: https://issues.apache.org/jira/browse/SPARK-6485 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng We should add APIs for CoordinateMatrix/RowMatrix/IndexedRowMatrix in PySpark. Internally, we can use DataFrames for serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8445) MLlib 1.5 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-8445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8445: - Description: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20labels%20%3D%20starter%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (3) JIRAs at the same time. Try to finish them one after another. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add starter label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment LGTM. For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if necessary. h1. Roadmap (WIP) This is NOT [a complete list of MLlib JIRAs for 1.5|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20%22Target%20Version%2Fs%22%20%3D%201.5.0%20ORDER%20BY%20priority%20DESC]. We only include umbrella JIRAs and high-level tasks. h2. Algorithms and performance * LDA improvements (SPARK-5572) * Log-linear model for survival analysis (SPARK-8518) * Improve GLM's scalability on number of features (SPARK-8520) * Tree and ensembles: Move + cleanup code (SPARK-7131), provide class probabilities (SPARK-3727), feature importance (SPARK-5133) * Improve GMM scalability and stability (SPARK-7206) * Frequent pattern mining improvements (SPARK-7211) * R-like stats for ML models (SPARK-7674) * Generalize classification threshold to multiclass (SPARK-8069) * A/B testing (SPARK-3147) h2. Pipeline API * more feature transformers (SPARK-8521) * k-means (SPARK-7879) * naive Bayes (SPARK-8600) * TrainValidationSplit for tuning (SPARK-8484) * Isotonic regression (SPARK-8671) h2. Model persistence * more PMML export (SPARK-8545) * model save/load (SPARK-4587) * pipeline persistence (SPARK-6725) h2. Python API for ML * List of issues identified during Spark 1.4 QA: (SPARK-7536) * Python API for streaming ML algorithms (SPARK-3258) * Add missing model methods (SPARK-8633) h2. SparkR API for ML * ML Pipeline API in SparkR (SPARK-6805) * model.matrix for DataFrames (SPARK-6823) h2. Documentation * [Search for documentation improvements | https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(Documentation)%20AND%20component%20in%20(ML%2C%20MLlib)] was: We expect to see many MLlib contributors for the 1.5 release. To scale out the development, we created this master list for MLlib features we plan to have in Spark 1.5. Please view this list as a wish list rather than a concrete plan, because we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list
[jira] [Created] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
Xiangrui Meng created SPARK-8872: Summary: Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8596) Install and configure RStudio server on Spark EC2
[ https://issues.apache.org/jira/browse/SPARK-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616924#comment-14616924 ] Shivaram Venkataraman commented on SPARK-8596: -- You can test this by launching a new cluster with a command that looks like {code} ./spark-ec2 -s 2 -t r3.xlarge -i pem -k key --spark-ec2-git-repo https://github.com/koaning/spark-ec2 --spark-ec2-git-branch rstudio-install launch rstudio-test {code} This cluster setup will now use the spark-ec2 scripts from your repo while setting things up. Once you think its good, you can open a PR on github.com/mesos/spark-ec2 Install and configure RStudio server on Spark EC2 - Key: SPARK-8596 URL: https://issues.apache.org/jira/browse/SPARK-8596 Project: Spark Issue Type: Improvement Components: EC2, SparkR Reporter: Shivaram Venkataraman This will make it convenient for R users to use SparkR from their browsers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
Reynold Xin created SPARK-8866: -- Summary: Use 1 microsecond (us) precision for TimestampType Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8726) Wrong spark.executor.memory when using different EC2 master and worker machine types
[ https://issues.apache.org/jira/browse/SPARK-8726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefano Parmesan closed SPARK-8726. --- Resolution: Fixed Fix Version/s: 1.4.0 Wrong spark.executor.memory when using different EC2 master and worker machine types Key: SPARK-8726 URL: https://issues.apache.org/jira/browse/SPARK-8726 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Stefano Parmesan Fix For: 1.4.0 _(this is a mirror of [MESOS-2985|https://issues.apache.org/jira/browse/MESOS-2985])_ By default, {{spark.executor.memory}} is set to the [min(slave_ram_kb, master_ram_kb)|https://github.com/mesos/spark-ec2/blob/e642aa362338e01efed62948ec0f063d5fce3242/deploy_templates.py#L32]; when using the same instance type for master and workers you will not notice, but when using different ones (which makes sense, as the master cannot be a spot instance, and using a big machine for the master would be a waste of resources) the default amount of memory given to each worker is capped to the amount of RAM available on the master (ex: if you create a cluster with an m1.small master (1.7GB RAM) and one m1.large worker (7.5GB RAM), spark.executor.memory will be set to 512MB). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616290#comment-14616290 ] Yijie Shen commented on SPARK-8866: --- I'll take this one. Use 1 microsecond (us) precision for TimestampType -- Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616264#comment-14616264 ] hujiayin commented on SPARK-6724: - Can I take a look at this issue? Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616301#comment-14616301 ] Adrian Wang commented on SPARK-8864: just provide the precise of current design for your information. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616299#comment-14616299 ] Adrian Wang commented on SPARK-8864: no, that's not enough. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6912: Summary: Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF (was: Support MapK,V as a return type in Hive UDF) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616354#comment-14616354 ] Apache Spark commented on SPARK-6487: - User 'zhangjiajin' has created a pull request for this issue: https://github.com/apache/spark/pull/7258 Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Assignee: Zhang JiaJin [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: PrefixSpan: Pei, Jian, et al. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 16.11 (2004): 1424-1440. Parallel Algorithm: Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. Distributed Algorithm: Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan algorithm based on MapReduce. Information Technology in Medicine and Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. Pattern mining and sequential mining Knowledge: Han, Jiawei, et al. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8868) SqlSerializer2 can go into infinite loop when row consists only of NullType columns
Josh Rosen created SPARK-8868: - Summary: SqlSerializer2 can go into infinite loop when row consists only of NullType columns Key: SPARK-8868 URL: https://issues.apache.org/jira/browse/SPARK-8868 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.5.0 Reporter: Josh Rosen Priority: Minor The following SQL query will cause an infinite loop in SqlSerializer2 code: {code} val df = sqlContext.sql(select null where 1 = 1) df.unionAll(df).sort(_c0).collect() {code} The same problem occurs if we add more null-literals, but does not occur as long as there is a column of any other type (e.g. {{select 1, null where 1 == 1}}). I think that what's happening here is that if you have a row that consists only of columns of NullType (not columns of other types which happen to only contain null values, but only columns of null literals), SqlSerializer will end up writing / reading no data for rows of this type. Since the deserialization stream will never try to read any data but nevertheless will be able to return an empty row, DeserializationStream.asIterator will go into an infinite loop since there will never be a read to trigger an EOF exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616270#comment-14616270 ] Reynold Xin commented on SPARK-8864: 1. Yes - not 100ns. I forgot to remove that paragraph. The table indicates using 12 bytes to store interval. 2. That makes sense. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs.pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616346#comment-14616346 ] Apache Spark commented on SPARK-6912: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/7257 Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8867) Show the UDF usage for user.
[ https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616357#comment-14616357 ] Apache Spark commented on SPARK-8867: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7259 Show the UDF usage for user. Key: SPARK-8867 URL: https://issues.apache.org/jira/browse/SPARK-8867 Project: Spark Issue Type: Task Components: SQL Reporter: Cheng Hao As Hive does, we need to provide the feature for user, to show the usage of a UDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8864: --- Attachment: SparkSQLdatetimeudfs (1).pdf Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8864: --- Attachment: (was: SparkSQLdatetimeudfs.pdf) Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616275#comment-14616275 ] hujiayin commented on SPARK-6724: - ok, : ) Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-6724: Comment: was deleted (was: Can I take a look at this issue?) Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616272#comment-14616272 ] Reynold Xin commented on SPARK-8864: I filed https://issues.apache.org/jira/browse/SPARK-8866 to change TimestampType itself to 1us precision instead of 100ns. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs.pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293 ] Adrian Wang edited comment on SPARK-8864 at 7/7/15 7:34 AM: Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E8 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. was (Author: adrian-wang): Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6912: --- Assignee: (was: Apache Spark) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6912) Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6912: --- Assignee: Apache Spark Throw an AnalysisException when unsupported Java MapK,V types used in Hive UDF Key: SPARK-6912 URL: https://issues.apache.org/jira/browse/SPARK-6912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Takeshi Yamamuro Assignee: Apache Spark The current implementation can't handle MapK,V as a return type in Hive UDF. We assume an UDF below; public class UDFToIntIntMap extends UDF { public MapInteger, Integer evaluate(Object o); } Hive supports this type, and see a link below for details; https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163 https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6487: --- Assignee: Zhang JiaJin (was: Apache Spark) Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Assignee: Zhang JiaJin [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: PrefixSpan: Pei, Jian, et al. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 16.11 (2004): 1424-1440. Parallel Algorithm: Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. Distributed Algorithm: Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan algorithm based on MapReduce. Information Technology in Medicine and Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. Pattern mining and sequential mining Knowledge: Han, Jiawei, et al. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8867) Show the UDF usage for user.
Cheng Hao created SPARK-8867: Summary: Show the UDF usage for user. Key: SPARK-8867 URL: https://issues.apache.org/jira/browse/SPARK-8867 Project: Spark Issue Type: Task Components: SQL Reporter: Cheng Hao As Hive does, we need to provide the feature for user, to show the usage of a UDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6487: --- Assignee: Apache Spark (was: Zhang JiaJin) Add sequential pattern mining algorithm to Spark MLlib -- Key: SPARK-6487 URL: https://issues.apache.org/jira/browse/SPARK-6487 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Zhang JiaJin Assignee: Apache Spark [~mengxr] [~zhangyouhua] Sequential pattern mining is an important branch in the pattern mining. In the past the actual work, we use the sequence mining (mainly PrefixSpan algorithm) to find the telecommunication signaling sequence pattern, achieved good results. But once the data is too large, the operation time is too long, even can not meet the the service requirements. We are ready to implement the PrefixSpan algorithm in spark, and applied to our subsequent work. The related Paper: PrefixSpan: Pei, Jian, et al. Mining sequential patterns by pattern-growth: The prefixspan approach. Knowledge and Data Engineering, IEEE Transactions on 16.11 (2004): 1424-1440. Parallel Algorithm: Cong, Shengnan, Jiawei Han, and David Padua. Parallel mining of closed sequential patterns. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005. Distributed Algorithm: Wei, Yong-qing, Dong Liu, and Lin-shan Duan. Distributed PrefixSpan algorithm based on MapReduce. Information Technology in Medicine and Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012. Pattern mining and sequential mining Knowledge: Han, Jiawei, et al. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery 15.1 (2007): 55-86. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7884) Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader
[ https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7884: - Assignee: Matt Massie Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader - Key: SPARK-7884 URL: https://issues.apache.org/jira/browse/SPARK-7884 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matt Massie Assignee: Matt Massie Fix For: 1.5.0 The current Spark shuffle has some hard-coded assumptions about how shuffle managers will read and write data. The BlockStoreShuffleFetcher.fetch method relies on the ShuffleBlockFetcherIterator that assumes shuffle data is written using the BlockManager.getDiskWriter method and doesn't allow for customization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616267#comment-14616267 ] Hrishikesh commented on SPARK-6724: --- [~hujiayin] sure! Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616268#comment-14616268 ] Adrian Wang commented on SPARK-8864: Thanks for the design. Two comments: 1. If a IntervalType value is in year-month format, we cannot use 100ns to represent it. Hive use two internal types to handle year-month and day-time separately. 2. When casting TimestampType into StringType, or casting from StringType(unless it is a ISO8601 time string which contains timezone info), we should also consider timezone. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs.pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8867) Show the UDF usage for user.
[ https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8867: --- Assignee: (was: Apache Spark) Show the UDF usage for user. Key: SPARK-8867 URL: https://issues.apache.org/jira/browse/SPARK-8867 Project: Spark Issue Type: Task Components: SQL Reporter: Cheng Hao As Hive does, we need to provide the feature for user, to show the usage of a UDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8867) Show the UDF usage for user.
[ https://issues.apache.org/jira/browse/SPARK-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8867: --- Assignee: Apache Spark Show the UDF usage for user. Key: SPARK-8867 URL: https://issues.apache.org/jira/browse/SPARK-8867 Project: Spark Issue Type: Task Components: SQL Reporter: Cheng Hao Assignee: Apache Spark As Hive does, we need to provide the feature for user, to show the usage of a UDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified
[ https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616257#comment-14616257 ] Apache Spark commented on SPARK-8851: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/7255 in Yarn client mode, Client.scala does not login even when credentials are specified Key: SPARK-8851 URL: https://issues.apache.org/jira/browse/SPARK-8851 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan [#6051|https://github.com/apache/spark/pull/6051] added support for passing the credentials configuration from SparkConf, so the client mode works fine. This though created an issue where the Client.scala class does not login to the KDC, thus requiring a kinit before running in Client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified
[ https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8851: --- Assignee: (was: Apache Spark) in Yarn client mode, Client.scala does not login even when credentials are specified Key: SPARK-8851 URL: https://issues.apache.org/jira/browse/SPARK-8851 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan [#6051|https://github.com/apache/spark/pull/6051] added support for passing the credentials configuration from SparkConf, so the client mode works fine. This though created an issue where the Client.scala class does not login to the KDC, thus requiring a kinit before running in Client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8851) in Yarn client mode, Client.scala does not login even when credentials are specified
[ https://issues.apache.org/jira/browse/SPARK-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8851: --- Assignee: Apache Spark in Yarn client mode, Client.scala does not login even when credentials are specified Key: SPARK-8851 URL: https://issues.apache.org/jira/browse/SPARK-8851 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Assignee: Apache Spark [#6051|https://github.com/apache/spark/pull/6051] added support for passing the credentials configuration from SparkConf, so the client mode works fine. This though created an issue where the Client.scala class does not login to the KDC, thus requiring a kinit before running in Client mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8866) Use 1 microsecond (us) precision for TimestampType
[ https://issues.apache.org/jira/browse/SPARK-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-8866: --- Assignee: Yijie Shen Use 1 microsecond (us) precision for TimestampType -- Key: SPARK-8866 URL: https://issues.apache.org/jira/browse/SPARK-8866 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen 100ns is slightly weird to compute. Let's use 1us to be more consistent with other systems (e.g. Postgres) and less error prone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616293#comment-14616293 ] Adrian Wang commented on SPARK-8864: Then we are using a Long for us. Long can be up to 9.2E18, which is more than 1E11 days. Hive is using a Long for seconds and an int for nanoseconds, but I think a single Long here for day-time interval is fine. Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark
[ https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616340#comment-14616340 ] Apache Spark commented on SPARK-8685: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7256 dataframe left joins are not working as expected in pyspark --- Key: SPARK-8685 URL: https://issues.apache.org/jira/browse/SPARK-8685 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 1.4.0 Environment: ubuntu 14.04 Reporter: axel dahl Assignee: Davies Liu I have the following code: {code} from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'carol', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) (df1.country == df2.country), 'left_outer').collect() {code} When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): {code} [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} I would expect to get (though ideally without duplicate columns): {code} [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', name=u'alice')] {code} The workaround for now is this rather clunky piece of code: {code} df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, (df1.name == df2.name2) (df1.country == df2.country2), 'left_outer').collect() {code} Also, {{.show()}} works {code} sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, (df1.name == df2.name) (df1.country == df2.country), 'left_outer').show() +---+---+-+--+---+-+ |age|country| name|colour|country| name| +---+---+-+--+---+-+ | 3|ire|carol| green|ire|carol| | 2|jpn|alice| null| null| null| | 1|usa| bob| red|usa| bob| +---+---+-+--+---+-+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8864) Date/time function and data type design
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616296#comment-14616296 ] Reynold Xin commented on SPARK-8864: Are you suggesting we use a single 8 byte long to store both the number of months and the number of microseconds? Date/time function and data type design --- Key: SPARK-8864 URL: https://issues.apache.org/jira/browse/SPARK-8864 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.5.0 Attachments: SparkSQLdatetimeudfs (1).pdf Please see the attached design doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8223) math function: shiftleft
[ https://issues.apache.org/jira/browse/SPARK-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8223: - Assignee: Tarek Auel (was: zhichao-li) math function: shiftleft Key: SPARK-8223 URL: https://issues.apache.org/jira/browse/SPARK-8223 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 shiftleft(INT a) shiftleft(BIGINT a) Bitwise left shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8224) math function: shiftright
[ https://issues.apache.org/jira/browse/SPARK-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8224: - Assignee: Tarek Auel (was: zhichao-li) math function: shiftright - Key: SPARK-8224 URL: https://issues.apache.org/jira/browse/SPARK-8224 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 shiftrightunsigned(INT a), shiftrightunsigned(BIGINT a) Bitwise unsigned right shift (as of Hive 1.2.0). Returns int for tinyint, smallint and int a. Returns bigint for bigint a. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
[ https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8865: - Description: (was: [~guowei2] Again, please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You're not setting the fields in your JIRAs as requested.) [~guowei2] Again, please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You're not setting the fields in your JIRAs as requested. Fix bug: init SimpleConsumerConfig with kafka params - Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8865) Fix bug: init SimpleConsumerConfig with kafka params
[ https://issues.apache.org/jira/browse/SPARK-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8865: - Priority: Minor (was: Major) Fix Version/s: (was: 1.4.0) Description: [~guowei2] Again, please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You're not setting the fields in your JIRAs as requested. Fix bug: init SimpleConsumerConfig with kafka params - Key: SPARK-8865 URL: https://issues.apache.org/jira/browse/SPARK-8865 Project: Spark Issue Type: Bug Components: Streaming Reporter: guowei Priority: Minor [~guowei2] Again, please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. You're not setting the fields in your JIRAs as requested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8674) [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation
[ https://issues.apache.org/jira/browse/SPARK-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Cambronero updated SPARK-8674: --- Summary: [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation (was: 2-sample, 2-sided Kolmogorov Smirnov Test Implementation) [WIP] 2-sample, 2-sided Kolmogorov Smirnov Test Implementation -- Key: SPARK-8674 URL: https://issues.apache.org/jira/browse/SPARK-8674 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Jose Cambronero Priority: Minor We added functionality to calculate a 2-sample, 2-sided Kolmogorov Smirnov test for 2 RDD[Double]. The calculation provides a test for the null hypothesis that both samples come from the same probability distribution. The implementation seeks to minimize the shuffles necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7422) Add argmax to Vector, SparseVector
[ https://issues.apache.org/jira/browse/SPARK-7422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7422: - Shepherd: Xiangrui Meng Add argmax to Vector, SparseVector -- Key: SPARK-7422 URL: https://issues.apache.org/jira/browse/SPARK-7422 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Labels: starter DenseVector has an argmax method which is currently private to Spark. It would be nice to add that method to Vector and SparseVector. Adding it to SparseVector would require being careful about handling the inactive elements correctly and efficiently. We should make argmax public and add unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8484) Add TrainValidationSplit to ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-8484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8484: - Priority: Critical (was: Major) Add TrainValidationSplit to ml.tuning - Key: SPARK-8484 URL: https://issues.apache.org/jira/browse/SPARK-8484 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Martin Zapletal Priority: Critical Add TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8627) ALS model predict error
[ https://issues.apache.org/jira/browse/SPARK-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617284#comment-14617284 ] Xiangrui Meng commented on SPARK-8627: -- The code looks okay to me. Which Spark version did you use, and Scala version? ALS model predict error --- Key: SPARK-8627 URL: https://issues.apache.org/jira/browse/SPARK-8627 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.4.0 Reporter: Subhod Lagade /** * Created by subhod lagade on 25/06/15. */ import org.apache.spark.SparkConf import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming._; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintStream; import java.net.ServerSocket; import java.net.Socket; import java.util.Properties; import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating object SparkStreamKafka { def main(args: Array[String]) { val conf = new SparkConf().setAppName(Simple Application); val sc = new SparkContext(conf); val data = sc.textFile(/home/appadmin/Disney/data.csv); val ratings = data.map(_.split(',') match { case Array(user, product, rate) = Rating(user.toInt, product.toInt, rate.toDouble) }); val rank = 3; val numIterations = 2; val model = ALS.train(ratings,rank,numIterations,0.01); val usersProducts = ratings.map{ case Rating(user, product, rate) = (user, product)} // Build the recommendation model using ALS usersProducts.foreach(println) val predictions = model.predict(usersProducts) } } /* ERROR Message [ERROR] /home/appadmin/disneypoc/src/main/scala/org/capgemini/SparkKafka.scala:53: error: not enough arguments for method predict: (user: Int, product: Int)Double. [INFO] Unspecified value parameter product. [INFO] val predictions = model.predict(usersProducts) */ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8559) Support association rule generation in FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8559. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7005 [https://github.com/apache/spark/pull/7005] Support association rule generation in FPGrowth --- Key: SPARK-8559 URL: https://issues.apache.org/jira/browse/SPARK-8559 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Guangwen Liu Assignee: Feynman Liang Fix For: 1.5.0 It will be more useful and practical to include the association rule generation part for real applications, though it is not hard by user to find association rules from the frequent itemset with frequency which is output by FP growth. However how to generate association rules in an efficient way is not widely reported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8872) Improve FPGrowthSuite with equivalent R code
[ https://issues.apache.org/jira/browse/SPARK-8872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617123#comment-14617123 ] Kashif Rasul commented on SPARK-8872: - I would like to work on this. Improve FPGrowthSuite with equivalent R code Key: SPARK-8872 URL: https://issues.apache.org/jira/browse/SPARK-8872 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 3h Remaining Estimate: 3h In `FPGrowthSuite`, we only tested output with minSupport 0.5, where the expected output is hard-coded. We can add equivalent R code using the arules package to generate the expect output for validation purpose, similar to https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala#L98 and the test code in https://github.com/apache/spark/pull/7005. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617159#comment-14617159 ] Sean Owen commented on SPARK-7917: -- I'm thinking of the two I mentioned above, in particular maybe SPARK-7503? Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8840) Float type coercion with hiveContext
[ https://issues.apache.org/jira/browse/SPARK-8840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617267#comment-14617267 ] Evgeny SInelnikov commented on SPARK-8840: -- I tested it on SparkSQL - problem not reproduced on it. I looked into spark/R sources and found, than deserialization for _float_ type not implemented. But not for double and it works: {code} sql(hiveContext, CREATE TABLE float_table (fl float, db double) row format delimited fields terminated by ',') result - sql(hiveContext, SELECT * from float_table) head(result) Error in readTypedObject(con, type) : Unsupported type for deserialization result - sql(hiveContext, LOAD DATA INPATH 'data.csv' INTO TABLE float_table) result - sql(hiveContext, SELECT * from float_table) head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame result - sql(hiveContext, SELECT db from float_table) head(result) db 1 1.1 2 2.0 {code} Float type coercion with hiveContext Key: SPARK-8840 URL: https://issues.apache.org/jira/browse/SPARK-8840 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Evgeny SInelnikov Problem with +float+ type coercion on SparkR with hiveContext. {code} result - sql(hiveContext, SELECT offset, percentage from data limit 100) show(result) DataFrame[offset:float, percentage:float] head(result) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class jobj to a data.frame {code} This trouble looks like already exists (SPARK-2863 - Emulate Hive type coercion in native reimplementations of Hive functions) with same reason - not completed native reimplementations of Hive... not ...functions only. I used spark 1.4.0 binaries from official site: http://spark.apache.org/downloads.html And running it on: * Hortonworks HDP 2.2.0.0-2041 * with Hive 0.14 * with disabled hooks for Application Timeline Servers (ATSHook) in hive-site.xml, commented: ** hive.exec.failure.hooks, ** hive.exec.post.hooks, ** hive.exec.pre.hooks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617278#comment-14617278 ] Sean Owen commented on SPARK-7917: -- Oops I mean executor. At least, I'm looking at Utils.getLocalFile, which ultimately calls getOrCreateLocalRootDirsImpl. You can see that on YARN, this uses YARN's dir and doesn't delete it on exit (YARN manages it). In the case that spark.local.dir config takes hold, you can also see it creates the dir if it doesn't exist and will delete it on shutdown in that case. However I suppose a few possible cases jump out where the dir is not deleted: - SPARK_EXECUTOR_DIRS is set - spark.local.dir is set but it already exists That is it seems to not delete dirs that were managed or set up externally. Does that explain this maybe? Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8874) Add missing methods in Word2Vec ML
[ https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-8874: --- Component/s: PySpark ML Add missing methods in Word2Vec ML -- Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8839) Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it
[ https://issues.apache.org/jira/browse/SPARK-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-8839: -- Shepherd: Yi Tian Thrift Sever will throw `java.util.NoSuchElementException: key not found` exception when many clients connect it - Key: SPARK-8839 URL: https://issues.apache.org/jira/browse/SPARK-8839 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: SaintBacchus If there are about 150+ JDBC clients connectting to the Thrift Server, some clients will throw such exception: {code:title=Exception message|borderStyle=solid} java.sql.SQLException: java.util.NoSuchElementException: key not found: 90d93e56-7f6d-45bf-b340-e3ee09dd60fc at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:155) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7555) User guide update for ElasticNet
[ https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7555: - Shepherd: Joseph K. Bradley User guide update for ElasticNet Key: SPARK-7555 URL: https://issues.apache.org/jira/browse/SPARK-7555 Project: Spark Issue Type: Documentation Components: ML Reporter: Joseph K. Bradley Assignee: Shuo Xiang Copied from [SPARK-7443]: {quote} Now that we have algorithms in spark.ml which are not in spark.mllib, we should start making subsections for the spark.ml API as needed. We can follow the structure of the spark.mllib user guide. * The spark.ml user guide can provide: (a) code examples and (b) info on algorithms which do not exist in spark.mllib. * We should not duplicate info in the spark.ml guides. Since spark.mllib is still the primary API, we should provide links to the corresponding algorithms in the spark.mllib user guide for more info. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8704: - Description: std, mean to StandardScalerModel ~~getVectors, findSynonyms to Word2Vec Model~~ ~~setFeatures and getFeatures to hashingTF~~ was: std, mean to StandardScalerModel getVectors, findSynonyms to Word2Vec Model setFeatures and getFeatures to hashingTF Add missing methods in StandardScaler (ML and PySpark) -- Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Fix For: 1.5.0 std, mean to StandardScalerModel ~~getVectors, findSynonyms to Word2Vec Model~~ ~~setFeatures and getFeatures to hashingTF~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7879) KMeans API for spark.ml Pipelines
[ https://issues.apache.org/jira/browse/SPARK-7879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7879: - Priority: Critical (was: Major) KMeans API for spark.ml Pipelines - Key: SPARK-7879 URL: https://issues.apache.org/jira/browse/SPARK-7879 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Yu Ishikawa Priority: Critical Create a K-Means API for the spark.ml Pipelines API. This should wrap the existing KMeans implementation in spark.mllib. This should be the first clustering method added to Pipelines, and it will be important to consider [SPARK-7610] and think about designing the clustering API. We do not have to have abstractions from the beginning (and probably should not) but should think far enough ahead so we can add abstractions later on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8874) Add missing methods in Word2Vec ML
[ https://issues.apache.org/jira/browse/SPARK-8874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617282#comment-14617282 ] Apache Spark commented on SPARK-8874: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/7263 Add missing methods in Word2Vec ML -- Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7337: - Shepherd: Xiangrui Meng FPGrowth algo throwing OutOfMemoryError --- Key: SPARK-7337 URL: https://issues.apache.org/jira/browse/SPARK-7337 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Environment: Ubuntu Reporter: Amit Gupta Attachments: FPGrowthBug.png When running FPGrowth algo with huge data in GBs and with numPartitions=500 then after some time it throws OutOfMemoryError. Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails to create 500 tasks and create some internal calculated 17 tasks. Please refer to attachment - print screen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8400) ml.ALS doesn't handle -1 block size
[ https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8400: - Shepherd: Xiangrui Meng ml.ALS doesn't handle -1 block size --- Key: SPARK-8400 URL: https://issues.apache.org/jira/browse/SPARK-8400 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.1 Reporter: Xiangrui Meng Under spark.mllib, if number blocks is set to -1, we set the block size automatically based on the input partition size. However, this behavior is not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617202#comment-14617202 ] Matt Cheah commented on SPARK-7917: --- Just wanted to clarify: Worker shutdown, or executor shutdown? We have long-running workers, so we would want directories to clean up on executor shutdown, not worker shutdown. Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5016: - Shepherd: Xiangrui Meng GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Assignee: Feynman Liang Labels: clustering If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617290#comment-14617290 ] Xiangrui Meng commented on SPARK-7337: -- [~amit.gupta.niit-tech] Any updates? FPGrowth algo throwing OutOfMemoryError --- Key: SPARK-7337 URL: https://issues.apache.org/jira/browse/SPARK-7337 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Environment: Ubuntu Reporter: Amit Gupta Attachments: FPGrowthBug.png When running FPGrowth algo with huge data in GBs and with numPartitions=500 then after some time it throws OutOfMemoryError. Algo runs correctly upto collect at FPGrowth.scala:131 where it creates 500 tasks. It fails at next stage flatMap at FPGrowth.scala:150 where it fails to create 500 tasks and create some internal calculated 17 tasks. Please refer to attachment - print screen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617177#comment-14617177 ] Sean Owen commented on SPARK-7917: -- Right, this is about standalone. There's https://github.com/apache/spark/pull/3705 but that was in 1.3. IIRC it looks like this dir gets cleaned up pretty reliably on worker shutdown if the JVM can exit pretty normally, so I think the question is, does it still happen on master? and wha causes the normal code path not to happen? Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8874) Add missing methods in Word2Vec ML
Manoj Kumar created SPARK-8874: -- Summary: Add missing methods in Word2Vec ML Key: SPARK-8874 URL: https://issues.apache.org/jira/browse/SPARK-8874 Project: Spark Issue Type: New Feature Reporter: Manoj Kumar Add getVectors and findSynonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8400) ml.ALS doesn't handle -1 block size
[ https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617292#comment-14617292 ] Xiangrui Meng commented on SPARK-8400: -- [~bryanc] Are you still working on this issue? ml.ALS doesn't handle -1 block size --- Key: SPARK-8400 URL: https://issues.apache.org/jira/browse/SPARK-8400 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.3.1 Reporter: Xiangrui Meng Under spark.mllib, if number blocks is set to -1, we set the block size automatically based on the input partition size. However, this behavior is not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not work, but no error messages will show. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617115#comment-14617115 ] Matt Cheah commented on SPARK-7917: --- [~sowen] was there a patch specifically written in master or 1.4.x that fixed this? Can you link a specific PR? Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617166#comment-14617166 ] Matt Cheah commented on SPARK-7917: --- Definitely not 7503 - the PR there only did things for YARN mode: https://github.com/apache/spark/pull/6026 Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7917) Spark doesn't clean up Application Directories (local dirs)
[ https://issues.apache.org/jira/browse/SPARK-7917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617166#comment-14617166 ] Matt Cheah edited comment on SPARK-7917 at 7/7/15 6:45 PM: --- Definitely not SPARK-7503 - the PR there only did things for YARN mode: https://github.com/apache/spark/pull/6026 was (Author: mcheah): Definitely not 7503 - the PR there only did things for YARN mode: https://github.com/apache/spark/pull/6026 Spark doesn't clean up Application Directories (local dirs) Key: SPARK-7917 URL: https://issues.apache.org/jira/browse/SPARK-7917 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zach Fry Priority: Minor Similar to SPARK-4834. Spark does clean up the cache and lock files in the local dirs, however, it doesn't clean up the actual directories. We have to write custom scripts to go back through the local dirs and find directories that don't contain any files and clear those out. Its a pretty simple repro: Run a job that does some shuffling, wait for the shuffle files to get cleaned up, go and look on disk at spark.local.dir and notice that the directory(s) are still there, but there are no files in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617184#comment-14617184 ] Juan RodrĂguez Hortalá commented on SPARK-8743: --- Hi, I guess you already have a good test for this, but just in case this is a minimal example for this issue https://gist.github.com/juanrh/464155a3aabbf2c3afa8 Deregister Codahale metrics for streaming when StreamingContext is closed -- Key: SPARK-8743 URL: https://issues.apache.org/jira/browse/SPARK-8743 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Assignee: Neelesh Srinivas Salian Labels: starter Currently, when the StreamingContext is closed, the registered metrics are not deregistered. If another streaming context is started, it throws a warning saying that the metrics are already registered. The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8704) Add missing methods in StandardScaler (ML and PySpark)
[ https://issues.apache.org/jira/browse/SPARK-8704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8704: - Description: Add std, mean to StandardScalerModel (was: std, mean to StandardScalerModel ~~getVectors, findSynonyms to Word2Vec Model~~ ~~setFeatures and getFeatures to hashingTF~~) Add missing methods in StandardScaler (ML and PySpark) -- Key: SPARK-8704 URL: https://issues.apache.org/jira/browse/SPARK-8704 Project: Spark Issue Type: New Feature Components: ML, PySpark Reporter: Manoj Kumar Assignee: Manoj Kumar Fix For: 1.5.0 Add std, mean to StandardScalerModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org