[jira] (SPARK-19393) Add `approx_percentile` Dataset/DataFrame API
Title: Message Title Liwei Lin updated an issue Spark / SPARK-19393 Add `approx_percentile` Dataset/DataFrame API Change By: Liwei Lin Summary: Add `approx_percentile` Dataframe Dataset/DataFrame API Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19393) Add `approx_percentile` Dataframe API
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19393 Add `approx_percentile` Dataframe API Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19393) Add `approx_percentile` Dataframe API
Title: Message Title Apache Spark commented on SPARK-19393 Re: Add `approx_percentile` Dataframe API User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/16731 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19393) Add `approx_percentile` Dataframe API
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19393 Add `approx_percentile` Dataframe API Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19393) Add `approx_percentile` Dataframe API
Title: Message Title Liwei Lin created an issue Spark / SPARK-19393 Add `approx_percentile` Dataframe API Issue Type: Improvement Assignee: Unassigned Components: SQL Created: 29/Jan/17 07:46 Priority: Minor Reporter: Liwei Lin Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Apache Spark commented on SPARK-19392 Re: Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/16733 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19392 Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19392 Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Takeshi Yamamuro updated an issue Spark / SPARK-19392 Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect Change By: Takeshi Yamamuro In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle jdbc, this throws an exception below;{code} java.util.NoSuchElementException: key not found: scale at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141){code} This ticket comes from https://www.mail-archive.com/user@spark.apache.org/msg61280.html. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Takeshi Yamamuro updated an issue Spark / SPARK-19392 Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect Change By: Takeshi Yamamuro Component/s: SQL Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
Title: Message Title Takeshi Yamamuro created an issue Spark / SPARK-19392 Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect Issue Type: Bug Affects Versions: 2.1.0 Assignee: Unassigned Created: 29/Jan/17 06:32 Priority: Minor Reporter: Takeshi Yamamuro In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle jdbc, this throws an exception below; java.util.NoSuchElementException: key not found: scale at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) Add Comment
[jira] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19368 Very bad performance in BlockMatrix.toIndexedRowMatrix() Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19368 Very bad performance in BlockMatrix.toIndexedRowMatrix() Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()
Title: Message Title Apache Spark commented on SPARK-19368 Re: Very bad performance in BlockMatrix.toIndexedRowMatrix() User 'uzadude' has created a pull request for this issue: https://github.com/apache/spark/pull/16732 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19323) Upgrade breeze to 0.13
Title: Message Title koert kuipers commented on SPARK-19323 Re: Upgrade breeze to 0.13 i tried to compile spark with breeze 0.13-RC1. ran into breeze regression. posted it here: https://github.com/scalanlp/breeze/issues/621 will create pullreq once i get spark to compile and pass tests against a new breeze RC. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] [Comment Edited] (SPARK-14709) spark.ml API for linear SVM
[ https://issues.apache.org/jira/browse/SPARK-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15835554#comment-15835554 ] Felix Cheung edited comment on SPARK-14709 at 1/28/17 11:17 PM: [~josephkb] should we add SparkR API as one follow up tasks? (I could shepherd that) was (Author: felixcheung): [~josephkb] should we add SparR API as one follow up tasks? (I could shepherd that) > spark.ml API for linear SVM > --- > > Key: SPARK-14709 > URL: https://issues.apache.org/jira/browse/SPARK-14709 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > Fix For: 2.2.0 > > > Provide API for SVM algorithm for DataFrames. I would recommend using > OWL-QN, rather than wrapping spark.mllib's SGD-based implementation. > The API should mimic existing spark.ml.classification APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19391) Tweedie GLM API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844180#comment-15844180 ] Apache Spark commented on SPARK-19391: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/16729 > Tweedie GLM API in SparkR > - > > Key: SPARK-19391 > URL: https://issues.apache.org/jira/browse/SPARK-19391 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Wayne Zhang > > Port Tweedie GLM to SparkR > https://github.com/apache/spark/pull/16344 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19391) Tweedie GLM API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19391: Assignee: (was: Apache Spark) > Tweedie GLM API in SparkR > - > > Key: SPARK-19391 > URL: https://issues.apache.org/jira/browse/SPARK-19391 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Wayne Zhang > > Port Tweedie GLM to SparkR > https://github.com/apache/spark/pull/16344 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19391) Tweedie GLM API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-19391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19391: Assignee: Apache Spark > Tweedie GLM API in SparkR > - > > Key: SPARK-19391 > URL: https://issues.apache.org/jira/browse/SPARK-19391 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Wayne Zhang >Assignee: Apache Spark > > Port Tweedie GLM to SparkR > https://github.com/apache/spark/pull/16344 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case
[ https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844173#comment-15844173 ] Apache Spark commented on SPARK-19359: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16728 > partition path created by Hive should be deleted after rename a partition > with upper-case > - > > Key: SPARK-19359 > URL: https://issues.apache.org/jira/browse/SPARK-19359 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Song Jun >Assignee: Song Jun >Priority: Minor > Fix For: 2.2.0 > > > Hive metastore is not case preserving and keep partition columns with lower > case names. > If SparkSQL create a table with upper-case partion name use > HiveExternalCatalog, when we rename partition, it first call the HiveClient > to renamePartition, which will create a new lower case partition path, then > SparkSql rename the lower case path to the upper-case. > while if the renamed partition contains more than one depth partition ,e.g. > A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to > A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19391) Tweedie GLM API in SparkR
Wayne Zhang created SPARK-19391: --- Summary: Tweedie GLM API in SparkR Key: SPARK-19391 URL: https://issues.apache.org/jira/browse/SPARK-19391 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Wayne Zhang Port Tweedie GLM to SparkR https://github.com/apache/spark/pull/16344 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11075) Spark SQL Thrift Server authentication issue on kerberized yarn cluster
[ https://issues.apache.org/jira/browse/SPARK-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844070#comment-15844070 ] Himangshu Borah commented on SPARK-11075: - This issue is not resolved. Found the same in spark 1.6.2. In a kerberos environment, where the spark-thrift and hiveServer2 processes are running through a user (User "hive" in my case), any command executed through the thrift is getting executed by that user("hive" in my case). But we are trying to impersonate the request as another user "Buser" as the table used in the query has access to "Buser" only. How I am using - beeline> !connect jdbc:hive2://:/default;principal=hive/something@something.com;hive.server2.proxy.user=Buser; And executing a select command on an existing table. The location for table have permission like - Buser:hdfs:drwx-- (700 permission for the owner only) Getting response - Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table example_table. org.apache.hadoop.security.AccessControlException: Permission denied: user=hive, access=EXECUTE, inode="/apps/hive/warehouse/some.db/example_table":Buser:hdfs:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) But same query is executing fine if we use the hive-thrift. The spark thrift is not respecting the property property hive.server2.proxy.user=Buser; and trying to execute the query with the user owning the spark-thrift process. > Spark SQL Thrift Server authentication issue on kerberized yarn cluster > > > Key: SPARK-11075 > URL: https://issues.apache.org/jira/browse/SPARK-11075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0, 1.5.1 > Environment: hive-1.2.1 > hadoop-2.6.0 config kerbers >Reporter: Xiaoyu Wang > > Use proxy user connect to the thrift server by beeline but got permission > exception: > 1.Start the hive 1.2.1 metastore with user hive > {code} > $kinit -kt /tmp/hive.keytab hive/xxx > $nohup ./hive --service metastore 2>&1 >> ../logs/metastore.log & > {code} > 2.Start the spark thrift server with user hive > {code} > $kinit -kt /tmp/hive.keytab hive/xxx > $./start-thriftserver.sh --master yarn > {code} > 3.Connect to the thrift server with proxy user hive01 > {code} > $kinit hive01 > beeline command:!connect > jdbc:hive2://xxx:1/default;principal=hive/x...@hadoop.com;kerberosAuthType=kerberos;hive.server2.proxy.user=hive01 > {code} > 4.Create table and insert data > {code} > create table test(name string); > insert overwrite table test select * from sometable; > {code} > the insert sql got exception: > {noformat} > Error: org.apache.hadoop.security.AccessControlException: Permission denied: > user=hive01, access=WRITE, > inode="/user/hive/warehouse/test/.hive-staging_hive_2015-10-10_09-17-15_972_3267668540808140587-2/-ext-1/_temporary/0/task_201510100917_0003_m_00":hive:hadoop:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:182) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6512) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInternal(FSNamesystem.java:3805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameToInt(FSNamesystem.java:3775) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.renameTo(FSNamesystem.java:3739) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rename(NameNodeRpcServer.java:754) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.rename(ClientNamenodeProtocolServerSideTranslatorPB.java:565) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844068#comment-15844068 ] Himangshu Borah commented on SPARK-5159: This issue is not resolved. Found the same in spark 1.6.2. In a kerberos environment, where the spark-thrift and hiveServer2 processes are running through a user (User "hive" in my case), any command executed through the thrift is getting executed by that user("hive" in my case). But we are trying to impersonate the request as another user "Buser" as the table used in the query has access to "Buser" only. How I am using - beeline> !connect jdbc:hive2://:/default;principal=hive/something@something.com;hive.server2.proxy.user=Buser; And executing a select command on an existing table. The location for table have permission like - Buser:hdfs:drwx-- (700 permission for the owner only) Getting response - Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table example_table. org.apache.hadoop.security.AccessControlException: Permission denied: user=hive, access=EXECUTE, inode="/apps/hive/warehouse/some.db/example_table":Buser:hdfs:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) But same query is executing fine if we use the hive-thrift. The spark thrift is not respecting the property property hive.server2.proxy.user=Buser; and trying to execute the query with the user owning the spark-thrift process. > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18781) Allow MatrixFactorizationModel.predict to skip user/product approximation count
[ https://issues.apache.org/jira/browse/SPARK-18781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18781. --- Resolution: Won't Fix > Allow MatrixFactorizationModel.predict to skip user/product approximation > count > --- > > Key: SPARK-18781 > URL: https://issues.apache.org/jira/browse/SPARK-18781 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Eyal Allweil >Priority: Minor > > When > [MatrixFactorizationModel.predict|https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.html#predict(org.apache.spark.rdd.RDD)] > is used, it first calculates an approximation count of the users and > products in order to determine the most efficient way to proceed. In many > cases, the answer to this question is fixed (typically there are more users > than products by an order of magnitude) and this check is unnecessary. Adding > a parameter to this predict method to allow choosing the implementation (and > skipping the check) would be nice. > It would be especially nice in development cycles when you are repeatedly > tweaking your model and which pairs you're predicting for and this > approximate count represents a meaningful portion of the time you wait for > results. > I can provide a pull request with this ability added that preserves the > existing behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14623) add label binarizer
[ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14623. --- Resolution: Won't Fix > add label binarizer > > > Key: SPARK-14623 > URL: https://issues.apache.org/jira/browse/SPARK-14623 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: hujiayin >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > It relates to https://issues.apache.org/jira/browse/SPARK-7445 > Map the labels to 0/1. > For example, > Input: > "yellow,green,red,green,0" > The labels: "0, green, red, yellow" > Output: > 0, 0, 0, 1 > 0, 1, 0, 0 > 0, 0, 1, 0 > 0, 1, 0, 0 > 1, 0 ,0, 0 > Refer to > http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19384) forget unpersist input dataset in IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-19384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19384: -- Assignee: zhengruifeng > forget unpersist input dataset in IsotonicRegression > > > Key: SPARK-19384 > URL: https://issues.apache.org/jira/browse/SPARK-19384 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > forget unpersist input dataset in IsotonicRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19384) forget unpersist input dataset in IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-19384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19384. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16718 [https://github.com/apache/spark/pull/16718] > forget unpersist input dataset in IsotonicRegression > > > Key: SPARK-19384 > URL: https://issues.apache.org/jira/browse/SPARK-19384 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > forget unpersist input dataset in IsotonicRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown
[ https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-19364: -- Component/s: (was: Spark Core) DStreams > Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are > enabled and an exception is thrown > -- > > Key: SPARK-19364 > URL: https://issues.apache.org/jira/browse/SPARK-19364 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: ubuntu unix > spark 2.0.2 > application is java >Reporter: Andrew Milkowski >Priority: Blocker > > -- update --- we found that below situation occurs when we encounter > "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: > Can't update checkpoint - instance doesn't hold the lease for this shard" > https://github.com/awslabs/amazon-kinesis-client/issues/108 > we use s3 directory (and dynamodb) to store checkpoints, but if such occurs > blocks should not get stuck but continue to be evicted gracefully from > memory, obviously kinesis library race condition is a problem onto itself... > -- exception leading to a block not being freed up -- > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException: Caught > shutdown exception, skipping checkpoint. > com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: > Can't update checkpoint - instance doesn't hold the lease for this shard > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137) > at > com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103) > at > org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117) > at > org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94) > at > org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106) > at > org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29) > running standard kinesis stream ingestion with a java spark app and creating > dstream after running for some time some block streams seem to persist > forever and never cleaned up and this eventually leads to memory depletion on > workers > we even tried cleaning RDD's with the following: > cleaner = ssc.sparkContext().sc().cleaner().get(); > filtered.foreachRDD(new VoidFunction() { > @Override > public void call(JavaRDD rdd) throws Exception { >
[jira] [Commented] (SPARK-14098) Generate code that get a float/double value in each column from CachedBatch when DataFrame.cache() is called
[ https://issues.apache.org/jira/browse/SPARK-14098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843996#comment-15843996 ] Shuai Lin commented on SPARK-14098: --- [~kiszk] Seems the title/description of this ticket is not on par with what is done in https://github.com/apache/spark/pull/15219 . Should we update the title/description here? > Generate code that get a float/double value in each column from CachedBatch > when DataFrame.cache() is called > > > Key: SPARK-14098 > URL: https://issues.apache.org/jira/browse/SPARK-14098 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kazuaki Ishizaki > > When DataFrame.cache() is called, data is stored as column-oriented storage > in CachedBatch. The current Catalyst generates Java program to get a value of > a column from an InternalRow that is translated from CachedBatch. This issue > generates Java code to get a value of a column from CachedBatch. While a > column for a cache may be compressed, this issue handles float and double > types that are never compressed. > Other primitive types, whose column may be compressed, will be addressed in > another entry. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19336) LinearSVC Python API
[ https://issues.apache.org/jira/browse/SPARK-19336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15843983#comment-15843983 ] Apache Spark commented on SPARK-19336: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16727 > LinearSVC Python API > > > Key: SPARK-19336 > URL: https://issues.apache.org/jira/browse/SPARK-19336 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Miao Wang > Fix For: 2.2.0 > > > Create a Python wrapper for spark.ml.classification.LinearSVC -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org