[jira] [Resolved] (SPARK-9081) fillna/dropna should also fill/drop NaN values in addition to null values
[ https://issues.apache.org/jira/browse/SPARK-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9081. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7523 [https://github.com/apache/spark/pull/7523] fillna/dropna should also fill/drop NaN values in addition to null values - Key: SPARK-9081 URL: https://issues.apache.org/jira/browse/SPARK-9081 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8915: - Assignee: Xiangrui Meng Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8641) Native Spark Window Functions
[ https://issues.apache.org/jira/browse/SPARK-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635205#comment-14635205 ] Herman van Hovell commented on SPARK-8641: -- We need to wait for the new UDAF interface to stabilize. Special attention needs to be paid to following aspects: * Hive UDAFs * Difference in processing an AlgebraicAggregate, AggregateFunction2 (potentially) AggregateFunction * Common aggregate processing functionality. Native Spark Window Functions - Key: SPARK-8641 URL: https://issues.apache.org/jira/browse/SPARK-8641 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.5.0 Reporter: Herman van Hovell The current Window implementation uses Hive UDAFs for all aggregation operations. In this ticket we will move to this functionality to Native Spark Expressions. The rationale for this is that although Hive UDAFs are very well written, they remain opaque in processing and memory management; this makes them hard to optimize. This ticket and its PR will build on the work being done in SPARK-4366. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9019) spark-submit fails on yarn with kerberos enabled
[ https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bolke de Bruin updated SPARK-9019: -- Attachment: debug-log-spark-1.5-fail spark-submit-log-1.5.0-fail spark-submit fails on yarn with kerberos enabled Key: SPARK-9019 URL: https://issues.apache.org/jira/browse/SPARK-9019 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.5.0 Environment: Hadoop 2.6 with YARN and kerberos enabled Reporter: Bolke de Bruin Labels: kerberos, spark-submit, yarn Attachments: debug-log-spark-1.5-fail, spark-submit-log-1.5.0-fail It is not possible to run jobs using spark-submit on yarn with a kerberized cluster. Commandline: /usr/hdp/2.2.0.0-2041/spark-1.5.0/bin/spark-submit --principal sparkjob --keytab sparkjob.keytab --num-executors 3 --executor-cores 5 --executor-memory 5G --master yarn-cluster /tmp/get_peers.py Fails with: 15/07/13 22:48:31 INFO server.Server: jetty-8.y.z-SNAPSHOT 15/07/13 22:48:31 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:58380 15/07/13 22:48:31 INFO util.Utils: Successfully started service 'SparkUI' on port 58380. 15/07/13 22:48:31 INFO ui.SparkUI: Started SparkUI at http://10.111.114.9:58380 15/07/13 22:48:31 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler 15/07/13 22:48:31 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 15/07/13 22:48:32 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43470. 15/07/13 22:48:32 INFO netty.NettyBlockTransferService: Server created on 43470 15/07/13 22:48:32 INFO storage.BlockManagerMaster: Trying to register BlockManager 15/07/13 22:48:32 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.111.114.9:43470 with 265.1 MB RAM, BlockManagerId(driver, 10.111.114.9, 43470) 15/07/13 22:48:32 INFO storage.BlockManagerMaster: Registered BlockManager 15/07/13 22:48:32 INFO impl.TimelineClientImpl: Timeline service address: http://lxhnl002.ad.ing.net:8188/ws/v1/timeline/ 15/07/13 22:48:33 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 15/07/13 22:48:33 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 15/07/13 22:48:33 INFO retry.RetryInvocationHandler: Exception while invoking getClusterNodes of class ApplicationClientProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 32582ms. java.net.ConnectException: Call From lxhnl006.ad.ing.net/10.111.114.9 to lxhnl013.ad.ing.net:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy24.getClusterNodes(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:262) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy25.getClusterNodes(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNodeReports(YarnClientImpl.java:475) at org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend$$anonfun$getDriverLogUrls$1.apply(YarnClusterSchedulerBackend.scala:92) at
[jira] [Resolved] (SPARK-9193) Avoid assigning tasks to executors under killing
[ https://issues.apache.org/jira/browse/SPARK-9193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-9193. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7528 [https://github.com/apache/spark/pull/7528] Avoid assigning tasks to executors under killing Key: SPARK-9193 URL: https://issues.apache.org/jira/browse/SPARK-9193 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.4.0, 1.4.1 Reporter: Jie Huang Assignee: Jie Huang Fix For: 1.5.0 Now, when some executors are killed by dynamic-allocation, it leads to some mis-assignment onto lost executors sometimes. Such kind of mis-assignment causes task failure(s) or even job failure if it repeats that errors for 4 times. The root cause is that killExecutors doesn't remove those executors under killing ASAP. It depends on the OnDisassociated event to refresh the active working list later. The delay time really depends on your cluster status (from several milliseconds to sub-minute). When new tasks to be scheduled during that period of time, it will be assigned to those active but under killing executors. Then the tasks will be failed due to executor lost. The better way is to exclude those executors under killing in the makeOffers(). Then all those tasks won't be allocated onto those executors to be lost any more. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9220) Streaming K-means implementation exception while processing windowed stream
[ https://issues.apache.org/jira/browse/SPARK-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635240#comment-14635240 ] Iaroslav Zeigerman commented on SPARK-9220: --- Looks like the issue reproduces only when training and test data streams are linked to the same directory. Can someone confirm if this cause the issue? Streaming K-means implementation exception while processing windowed stream --- Key: SPARK-9220 URL: https://issues.apache.org/jira/browse/SPARK-9220 Project: Spark Issue Type: Bug Components: MLlib, Streaming Affects Versions: 1.4.1 Reporter: Iaroslav Zeigerman Spark throws an exception when the Streaming K-means algorithm trains on a windowed stream. The stream looks like following: {{val trainingSet = ssc.textFileStream(TrainingDataSet).window(Seconds(30))...}} The exception occurs when there is no new data in the stream. Here is an exception: 15/07/21 17:36:08 ERROR JobScheduler: Error running job streaming job 1437489368000 ms.0 java.lang.ArrayIndexOutOfBoundsException: 13 at org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:105) at org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:102) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.mllib.clustering.StreamingKMeansModel.update(StreamingKMeans.scala:102) at org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:235) at org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:234) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) When the new data arrives the algorithm works as expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9168) Add nanvl expression
[ https://issues.apache.org/jira/browse/SPARK-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9168. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7523 [https://github.com/apache/spark/pull/7523] Add nanvl expression Key: SPARK-9168 URL: https://issues.apache.org/jira/browse/SPARK-9168 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen Fix For: 1.5.0 Similar to Oracle's nanvl: nanvl(v1, v2) if v1 is NaN, returns v2; otherwise, returns v1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9221) Support IntervalType in Range Frame
Herman van Hovell created SPARK-9221: Summary: Support IntervalType in Range Frame Key: SPARK-9221 URL: https://issues.apache.org/jira/browse/SPARK-9221 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.4.0 Reporter: Herman van Hovell Support the IntervalType in window range frames, as mentioned in the conclusion of the databricks blog [post|https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html] on window functions. This actualy requires us to support Literals instead of Integer constants in Range Frames. The following things will have to be modified: * org.apache.spark.sql.hive.HiveQl * org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame * org.apache.spark.sql.execution.Window * org.apache.spark.sql.expressions.Window -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8915: - Assignee: Patrick Baier (was: Xiangrui Meng) Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Patrick Baier Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8915: - Shepherd: DB Tsai Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Patrick Baier Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8922) Add @since tags to mllib.evaluation
[ https://issues.apache.org/jira/browse/SPARK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8922: - Shepherd: Shuo Xiang Add @since tags to mllib.evaluation --- Key: SPARK-8922 URL: https://issues.apache.org/jira/browse/SPARK-8922 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635300#comment-14635300 ] Shivaram Venkataraman commented on SPARK-9121: -- Yeah we can add `install-dev.sh` in Jenkins before dev/lint-r. One unfortunate thing is that we typically do a lint-check before we run the rest of the Jenkins tests (build, unit tests etc.) So it would be good to not have this be the other way around I guess Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9210) checkValidAggregateExpression() throws exceptions with bad error messages
[ https://issues.apache.org/jira/browse/SPARK-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simeon Simeonov updated SPARK-9210: --- Description: When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP BY}} expressions nor uses an aggregation function, {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws {{org.apache.spark.sql.AnalysisException}} with the message expression '_column expression_' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get. The remedy suggestion in the exception message is incorrect: the function name is {{first_value}}, not {{first}}. was: When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP BY}} expressions nor uses an aggregation function, {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws {{org.apache.spark.sql.AnalysisException}} with the message expression '_column expression_' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get. The remedy suggestion in the exception message incorrect: the function name is {{first_value}}, not {{first}}. checkValidAggregateExpression() throws exceptions with bad error messages - Key: SPARK-9210 URL: https://issues.apache.org/jira/browse/SPARK-9210 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: N/A Reporter: Simeon Simeonov Priority: Trivial When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP BY}} expressions nor uses an aggregation function, {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws {{org.apache.spark.sql.AnalysisException}} with the message expression '_column expression_' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get. The remedy suggestion in the exception message is incorrect: the function name is {{first_value}}, not {{first}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636180#comment-14636180 ] Joseph Batchik commented on SPARK-8668: --- Does this look like what you were thinking? https://github.com/JDrit/spark/commit/7fcf18a11427709d403418da8d444b434c63 expr function to convert SQL expression into a Column - Key: SPARK-8668 URL: https://issues.apache.org/jira/browse/SPARK-8668 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin selectExpr uses the expression parser to parse a string expressions. would be great to create an expr function in functions.scala/functions.py that converts a string into an expression (or a list of expressions separated by comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9241) Supporting multiple DISTINCT columns
Yin Huai created SPARK-9241: --- Summary: Supporting multiple DISTINCT columns Key: SPARK-9241 URL: https://issues.apache.org/jira/browse/SPARK-9241 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Critical Right now the new aggregation code path only support a single distinct column (you can use it in multiple aggregate functions in the query). We need to support multiple distinct columns by generating a different plan for handling multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9242) Audit both built-in aggregate function and UDAF interface before 1.5.0 release
Yin Huai created SPARK-9242: --- Summary: Audit both built-in aggregate function and UDAF interface before 1.5.0 release Key: SPARK-9242 URL: https://issues.apache.org/jira/browse/SPARK-9242 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column
[ https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636324#comment-14636324 ] Reynold Xin commented on SPARK-8668: Yes - the only thing is you cannot split blindly by comma since commas can be in quotes. I think it is ok for the first cut to not support list of expressions separated by comma. expr function to convert SQL expression into a Column - Key: SPARK-8668 URL: https://issues.apache.org/jira/browse/SPARK-8668 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin selectExpr uses the expression parser to parse a string expressions. would be great to create an expr function in functions.scala/functions.py that converts a string into an expression (or a list of expressions separated by comma). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9237) Added Top N Column Values for DataFrames
Ted Malaska created SPARK-9237: -- Summary: Added Top N Column Values for DataFrames Key: SPARK-9237 URL: https://issues.apache.org/jira/browse/SPARK-9237 Project: Spark Issue Type: Improvement Reporter: Ted Malaska Priority: Minor This jira is to add a very common data quality check into dataframes. A quick outline of this functionality can be seen in the following blog post http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ There are two parts to this Jira. 1. How to implement the Top N Count. Which I will start with the implementation in the blog 2. Where to add the function. Ether straight off Dataframe, in Dataframe describe or in DataFrameStatFunctions. I will start with putting it into DataFrameStatFunctions. Please let me know if you have any input. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3056) Sort-based Aggregation
[ https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3056: --- Assignee: Apache Spark Sort-based Aggregation -- Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) Support UDAF
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636199#comment-14636199 ] Apache Spark commented on SPARK-3947: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7458 Support UDAF Key: SPARK-3947 URL: https://issues.apache.org/jira/browse/SPARK-3947 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Pei-Lun Lee Assignee: Yin Huai Right now only Hive UDAFs are supported. It would be nice to have UDAF similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3056) Sort-based Aggregation
[ https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636200#comment-14636200 ] Apache Spark commented on SPARK-3056: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7458 Sort-based Aggregation -- Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4367) Partial aggregation support the DISTINCT aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636198#comment-14636198 ] Apache Spark commented on SPARK-4367: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7458 Partial aggregation support the DISTINCT aggregation Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4367) Partial aggregation support the DISTINCT aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4367: --- Assignee: Apache Spark Partial aggregation support the DISTINCT aggregation Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Assignee: Apache Spark Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4233) Simplify the Aggregation Function implementation
[ https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636197#comment-14636197 ] Apache Spark commented on SPARK-4233: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/7458 Simplify the Aggregation Function implementation Key: SPARK-4233 URL: https://issues.apache.org/jira/browse/SPARK-4233 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, the UDAF implementation is quite complicated, and we have to provide distinct non-distinct version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9243) Update crosstab doc for pairs that have no occurrences
[ https://issues.apache.org/jira/browse/SPARK-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-9243: - Component/s: Documentation Update crosstab doc for pairs that have no occurrences -- Key: SPARK-9243 URL: https://issues.apache.org/jira/browse/SPARK-9243 Project: Spark Issue Type: Improvement Components: Documentation, PySpark, SparkR, SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng The crosstab value for pairs that have no occurrences was changed from null to 0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9243) Update crosstab doc for pairs that have no occurrences
Xiangrui Meng created SPARK-9243: Summary: Update crosstab doc for pairs that have no occurrences Key: SPARK-9243 URL: https://issues.apache.org/jira/browse/SPARK-9243 Project: Spark Issue Type: Improvement Components: PySpark, SparkR, SQL Affects Versions: 1.5.0 Reporter: Xiangrui Meng The crosstab value for pairs that have no occurrences was changed from null to 0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8915. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7371 [https://github.com/apache/spark/pull/7371] Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Fix For: 1.5.0 Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9036) SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol
[ https://issues.apache.org/jira/browse/SPARK-9036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-9036. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol Key: SPARK-9036 URL: https://issues.apache.org/jira/browse/SPARK-9036 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0, 1.4.1 Reporter: Ryan Williams Priority: Minor Fix For: 1.5.0 The JsonProtocol added in SPARK-3454 [doesn't include|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L95-L96] code for ser/de of [{{SparkListenerExecutorMetricsUpdate}}|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L107-L110] messages. The comment notes that they are not used, which presumably refers to the fact that the [{{EventLoggingListener}} doesn't write these events|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L200-L201]. However, individual listeners can and should make that determination for themselves; I have recently written custom listeners that would like to consume metrics-update messages as JSON, so it would be nice to round out the JsonProtocol implementation by supporting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it
[ https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5423. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it --- Key: SPARK-5423 URL: https://issues.apache.org/jira/browse/SPARK-5423 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.0.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fix For: 1.5.0 ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it. There is already a TODO in the comment: {code} // TODO: Ensure this gets called even if the iterator isn't drained. private def cleanup() { batchIndex = batchOffsets.length // Prevent reading any other batch val ds = deserializeStream deserializeStream = null fileStream = null ds.close() file.delete() } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9154) Implement code generation for StringFormat
[ https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-9154. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7546 [https://github.com/apache/spark/pull/7546] Implement code generation for StringFormat -- Key: SPARK-9154 URL: https://issues.apache.org/jira/browse/SPARK-9154 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8231) complex function: array_contains
[ https://issues.apache.org/jira/browse/SPARK-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635425#comment-14635425 ] Pedro Rodriguez commented on SPARK-8231: I can give this one a shot since I already worked on size, which is somewhat similar. complex function: array_contains Key: SPARK-8231 URL: https://issues.apache.org/jira/browse/SPARK-8231 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao array_contains(ArrayT, value) Returns TRUE if the array contains value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9122) spark.mllib regression should support batch predict
[ https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9122: - Shepherd: Joseph K. Bradley Assignee: Yanbo Liang Remaining Estimate: 72h Original Estimate: 72h spark.mllib regression should support batch predict --- Key: SPARK-9122 URL: https://issues.apache.org/jira/browse/SPARK-9122 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yanbo Liang Labels: starter Original Estimate: 72h Remaining Estimate: 72h Currently, in spark.mllib, generalized linear regression models like LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() via: LinearRegressionModelBase.predict, which only takes single rows (feature vectors). It should support batch prediction, taking an RDD. (See other classes which do this already such as NaiveBayesModel.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8481) GaussianMixtureModel predict accepting single vector
[ https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8481: - Assignee: Dariusz Kobylarz GaussianMixtureModel predict accepting single vector Key: SPARK-8481 URL: https://issues.apache.org/jira/browse/SPARK-8481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Dariusz Kobylarz Assignee: Dariusz Kobylarz Priority: Minor Labels: GaussianMixtureModel, MLlib Fix For: 1.5.0 Original Estimate: 24h Remaining Estimate: 24h GaussianMixtureModel lacks a method to predict a cluster for a single input vector where no spark context would be involved, i.e. /** Maps given point to its cluster index. */ def predict(point: Vector) : Int -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3157) Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates
[ https://issues.apache.org/jira/browse/SPARK-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635469#comment-14635469 ] Joseph K. Bradley commented on SPARK-3157: -- Good point, I'll close this. Thanks! Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates - Key: SPARK-3157 URL: https://issues.apache.org/jira/browse/SPARK-3157 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Improvement: computation, memory usage For ordered features, extractLeftRightNodeAggregates() computes pairs of cumulative sums. However, these sums are redundant since they are simply cumulative sums accumulating from the left and right ends, respectively. Only compute one sum. For unordered features, the left and right aggregates are essentially the same data, copied from the original aggregates, but shifted by one index. Avoid copying data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9224: --- Assignee: (was: Apache Spark) OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Priority: Critical OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9224: --- Assignee: Apache Spark OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Apache Spark Priority: Critical OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9154) Implement code generation for StringFormat
[ https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635510#comment-14635510 ] Apache Spark commented on SPARK-9154: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7570 Implement code generation for StringFormat -- Key: SPARK-9154 URL: https://issues.apache.org/jira/browse/SPARK-9154 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9128) Get outerclasses and objects at the same time in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-9128. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Get outerclasses and objects at the same time in ClosureCleaner --- Key: SPARK-9128 URL: https://issues.apache.org/jira/browse/SPARK-9128 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.5.0 Currently, in ClosureCleaner, the outerclasses and objects are retrieved using two different methods. However, the logic of the two methods is the same, and we can get both the outerclasses and objects with only one method calling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7171) Allow for more flexible use of metric sources
[ https://issues.apache.org/jira/browse/SPARK-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-7171: - Assignee: Jacek Lewandowski Allow for more flexible use of metric sources - Key: SPARK-7171 URL: https://issues.apache.org/jira/browse/SPARK-7171 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Priority: Minor Fix For: 1.5.0 With the current API, the user is allowed to add a custom metric source by providing its class in metrics configuration. Metrics themselves are provided by Codahale and therefore they allow to register multiple metrics in a single source. Basically we can break the available types of metrics into two types: push and pull - by push metrics I mean that some execution code updates the metric by itself either periodically or every n events. On the other hand, the pull metrics include some function which pulls the data from the execution environment, when triggered. h5.Problem The metric source is instantiated and registered during initialisation. Then, the user has no way to access the instantiated object. It is also almost impossible to access the execution environment of the current task. Therefore, the user who wanted to provide his own {{RDD}} implementation along with a dedicated metrics source, would find it very difficult to do this in a safe, concise and elegant way. h5.Proposed solution At least, for the push metrics, it would be nice to be able to retrieve the metrics source of particular type or with particular id from {{TaskContext}}. It would allow custom tasks to update various metrics and would greatly improve the usability of metrics. This could be achieved quite easily since {{TaskContext}} is created by {{Executor}}, which has access to the metrics system, it would inject some method to retrieve the particular metrics source. This solution wouldn't change the current API, but just introduce one more method in {{TaskContext}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7171) Allow for more flexible use of metric sources
[ https://issues.apache.org/jira/browse/SPARK-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-7171. Resolution: Fixed Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Allow for more flexible use of metric sources - Key: SPARK-7171 URL: https://issues.apache.org/jira/browse/SPARK-7171 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Priority: Minor Fix For: 1.5.0 With the current API, the user is allowed to add a custom metric source by providing its class in metrics configuration. Metrics themselves are provided by Codahale and therefore they allow to register multiple metrics in a single source. Basically we can break the available types of metrics into two types: push and pull - by push metrics I mean that some execution code updates the metric by itself either periodically or every n events. On the other hand, the pull metrics include some function which pulls the data from the execution environment, when triggered. h5.Problem The metric source is instantiated and registered during initialisation. Then, the user has no way to access the instantiated object. It is also almost impossible to access the execution environment of the current task. Therefore, the user who wanted to provide his own {{RDD}} implementation along with a dedicated metrics source, would find it very difficult to do this in a safe, concise and elegant way. h5.Proposed solution At least, for the push metrics, it would be nice to be able to retrieve the metrics source of particular type or with particular id from {{TaskContext}}. It would allow custom tasks to update various metrics and would greatly improve the usability of metrics. This could be achieved quite easily since {{TaskContext}} is created by {{Executor}}, which has access to the metrics system, it would inject some method to retrieve the particular metrics source. This solution wouldn't change the current API, but just introduce one more method in {{TaskContext}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5989) Model import/export for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5989. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6948 [https://github.com/apache/spark/pull/6948] Model import/export for LDAModel Key: SPARK-5989 URL: https://issues.apache.org/jira/browse/SPARK-5989 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Fix For: 1.5.0 Add save/load for LDAModel and its local and distributed variants. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9183) NPE / confusing error message when looking up missing function in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635460#comment-14635460 ] Reynold Xin commented on SPARK-9183: Actually even that error message is bad - we should throw our own analysis exception here, not letting Hive throwing it. NPE / confusing error message when looking up missing function in Spark SQL --- Key: SPARK-9183 URL: https://issues.apache.org/jira/browse/SPARK-9183 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Josh Rosen Priority: Blocker Try running the following query in Spark Shell with Hive enabled: {code} sqlContext.sql(select substr(abc, 0, len(ab) - 1)) {code} This query is malformed since there's no {{len}} UDF. Unfortunately, though, this gives a really confusing error as of Spark 1.4: {code} : java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:643) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:380) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44) at org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:380) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [...] {code} In Spark 1.3, on the other hand, this gives a helpful message: {code} : java.lang.RuntimeException: Couldn't find function len at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$1.apply(hiveUdfs.scala:55) at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$1.apply(hiveUdfs.scala:55) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54) at org.apache.spark.sql.hive.HiveContext$$anon$4.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:267) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:43) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:43) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:43) at org.apache.spark.sql.hive.HiveContext$$anon$4.lookupFunction(HiveContext.scala:267) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:431) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:429) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code
[ https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635480#comment-14635480 ] Robert Beauchemin commented on SPARK-9078: -- That was quick. Not sure that I have all the pieces in place for building right now, is it required? ;-) Currently I was just browsing the source code to figure out what would be required to add/use a new fully supported JDBC-based data source (how all the pieces worked) and came across the hardcoded SQL statement. Use of non-standard LIMIT keyword in JDBC tableExists code -- Key: SPARK-9078 URL: https://issues.apache.org/jira/browse/SPARK-9078 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Robert Beauchemin Priority: Minor tableExists in spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses non-standard SQL (specifically, the LIMIT keyword) to determine whether a table exists in a JDBC data source. This will cause an exception in many/most JDBC databases that doesn't support LIMIT keyword. See http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql To check for table existence or an exception, it could be recrafted around select 1 from $table where 0 = 1 which isn't the same (it returns an empty resultset rather than the value '1'), but would support more data sources and also support empty tables. Arguably ugly and possibly queries every row on sources that don't support constant folding, but better than failing on JDBC sources that don't support LIMIT. Perhaps supports LIMIT could be a field in the JdbcDialect class for databases that support keyword this to override. The ANSI standard is (OFFSET and) FETCH. The standard way to check for table existence would be to use information_schema.tables which is a SQL standard but may not work for other JDBC data sources that support SQL, but not the information_schema. The JDBC DatabaseMetaData interface provides getSchemas() that allows checking for the information_schema in drivers that support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9024) Unsafe HashJoin
[ https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-9024: - Assignee: Davies Liu Unsafe HashJoin --- Key: SPARK-9024 URL: https://issues.apache.org/jira/browse/SPARK-9024 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin Assignee: Davies Liu Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and outputs UnsafeRow as outputs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7105) Support model save/load in Python's GaussianMixture
[ https://issues.apache.org/jira/browse/SPARK-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635424#comment-14635424 ] Manoj Kumar commented on SPARK-7105: Hi, Are you still working on this? Support model save/load in Python's GaussianMixture --- Key: SPARK-7105 URL: https://issues.apache.org/jira/browse/SPARK-7105 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Yu Ishikawa Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9223) Support model save/load in Python's LDA
Manoj Kumar created SPARK-9223: -- Summary: Support model save/load in Python's LDA Key: SPARK-9223 URL: https://issues.apache.org/jira/browse/SPARK-9223 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code
[ https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635459#comment-14635459 ] Robert Beauchemin commented on SPARK-9078: -- Great, I didn't realize that JdbcDialects.registerDialect was a public API, passing it through to the jdbc data source would do it. Cheers, and thanks, Bob Use of non-standard LIMIT keyword in JDBC tableExists code -- Key: SPARK-9078 URL: https://issues.apache.org/jira/browse/SPARK-9078 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Robert Beauchemin Priority: Minor tableExists in spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses non-standard SQL (specifically, the LIMIT keyword) to determine whether a table exists in a JDBC data source. This will cause an exception in many/most JDBC databases that doesn't support LIMIT keyword. See http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql To check for table existence or an exception, it could be recrafted around select 1 from $table where 0 = 1 which isn't the same (it returns an empty resultset rather than the value '1'), but would support more data sources and also support empty tables. Arguably ugly and possibly queries every row on sources that don't support constant folding, but better than failing on JDBC sources that don't support LIMIT. Perhaps supports LIMIT could be a field in the JdbcDialect class for databases that support keyword this to override. The ANSI standard is (OFFSET and) FETCH. The standard way to check for table existence would be to use information_schema.tables which is a SQL standard but may not work for other JDBC data sources that support SQL, but not the information_schema. The JDBC DatabaseMetaData interface provides getSchemas() that allows checking for the information_schema in drivers that support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
Feynman Liang created SPARK-9224: Summary: OnlineLDAOptimizer Performance Improvements Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Priority: Critical OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9225) LDASuite needs unit tests for empty documents
Feynman Liang created SPARK-9225: Summary: LDASuite needs unit tests for empty documents Key: SPARK-9225 URL: https://issues.apache.org/jira/browse/SPARK-9225 Project: Spark Issue Type: Test Components: MLlib Reporter: Feynman Liang Priority: Minor We need to add a unit test to {{LDASuite}} which check that empty documents are handled appropriately without crashing. This would require defining an empty corpus within {{LDASuite}} and adding tests for the available LDA optimizers (currently EM and Online). Note that only {{SparseVector}}s can be empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9224: - Priority: Major (was: Critical) Issue Type: Improvement (was: Bug) OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Feynman Liang OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks
[ https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4598. Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 1.5.0 Target Version/s: 1.5.0 Paginate stage page to avoid OOM with 100,000 tasks - Key: SPARK-4598 URL: https://issues.apache.org/jira/browse/SPARK-4598 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.2.0 Reporter: meiyoula Assignee: Shixiong Zhu Fix For: 1.5.0 In HistoryServer stage page, clicking the task href in Description, it occurs the GC error. The detail error message is: 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-352] | Error for /history/application_1416206401491_0010/stages/stage/ | org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590) java.lang.OutOfMemoryError: GC overhead limit exceeded 2014-11-17 16:36:30,851 | WARN | [qtp1083955615-364] | handle failed | org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697) java.lang.OutOfMemoryError: GC overhead limit exceeded -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9165) Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct
[ https://issues.apache.org/jira/browse/SPARK-9165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-9165: Shepherd: Michael Armbrust Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct -- Key: SPARK-9165 URL: https://issues.apache.org/jira/browse/SPARK-9165 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Yijie Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering
Manoj Kumar created SPARK-9222: -- Summary: Make class instantiation variables in DistributedLDAModel [private] clustering Key: SPARK-9222 URL: https://issues.apache.org/jira/browse/SPARK-9222 Project: Spark Issue Type: Test Components: MLlib Reporter: Manoj Kumar Priority: Minor This would enable testing the various class variables like docConcentration, topicConcentration etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code
[ https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635461#comment-14635461 ] Reynold Xin commented on SPARK-9078: Want to submit a pull request? :) Use of non-standard LIMIT keyword in JDBC tableExists code -- Key: SPARK-9078 URL: https://issues.apache.org/jira/browse/SPARK-9078 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.4.0 Reporter: Robert Beauchemin Priority: Minor tableExists in spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses non-standard SQL (specifically, the LIMIT keyword) to determine whether a table exists in a JDBC data source. This will cause an exception in many/most JDBC databases that doesn't support LIMIT keyword. See http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql To check for table existence or an exception, it could be recrafted around select 1 from $table where 0 = 1 which isn't the same (it returns an empty resultset rather than the value '1'), but would support more data sources and also support empty tables. Arguably ugly and possibly queries every row on sources that don't support constant folding, but better than failing on JDBC sources that don't support LIMIT. Perhaps supports LIMIT could be a field in the JdbcDialect class for databases that support keyword this to override. The ANSI standard is (OFFSET and) FETCH. The standard way to check for table existence would be to use information_schema.tables which is a SQL standard but may not work for other JDBC data sources that support SQL, but not the information_schema. The JDBC DatabaseMetaData interface provides getSchemas() that allows checking for the information_schema in drivers that support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9154) Implement code generation for StringFormat
[ https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-9154: - Reopening since this broke the build. Implement code generation for StringFormat -- Key: SPARK-9154 URL: https://issues.apache.org/jira/browse/SPARK-9154 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9122) spark.mllib regression should support batch predict
[ https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635454#comment-14635454 ] Joseph K. Bradley commented on SPARK-9122: -- OK thank you! spark.mllib regression should support batch predict --- Key: SPARK-9122 URL: https://issues.apache.org/jira/browse/SPARK-9122 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Joseph K. Bradley Labels: starter Currently, in spark.mllib, generalized linear regression models like LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() via: LinearRegressionModelBase.predict, which only takes single rows (feature vectors). It should support batch prediction, taking an RDD. (See other classes which do this already such as NaiveBayesModel.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8481) GaussianMixtureModel predict accepting single vector
[ https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-8481. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6906 [https://github.com/apache/spark/pull/6906] GaussianMixtureModel predict accepting single vector Key: SPARK-8481 URL: https://issues.apache.org/jira/browse/SPARK-8481 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Dariusz Kobylarz Priority: Minor Labels: GaussianMixtureModel, MLlib Fix For: 1.5.0 Original Estimate: 24h Remaining Estimate: 24h GaussianMixtureModel lacks a method to predict a cluster for a single input vector where no spark context would be involved, i.e. /** Maps given point to its cluster index. */ def predict(point: Vector) : Int -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3157) Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates
[ https://issues.apache.org/jira/browse/SPARK-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-3157. Resolution: Fixed Assignee: Joseph K. Bradley Fix Version/s: 1.2.0 Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates - Key: SPARK-3157 URL: https://issues.apache.org/jira/browse/SPARK-3157 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Fix For: 1.2.0 Improvement: computation, memory usage For ordered features, extractLeftRightNodeAggregates() computes pairs of cumulative sums. However, these sums are redundant since they are simply cumulative sums accumulating from the left and right ends, respectively. Only compute one sum. For unordered features, the left and right aggregates are essentially the same data, copied from the original aggregates, but shifted by one index. Avoid copying data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9224) OnlineLDAOptimizer Performance Improvements
[ https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635496#comment-14635496 ] Apache Spark commented on SPARK-9224: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7454 OnlineLDAOptimizer Performance Improvements --- Key: SPARK-9224 URL: https://issues.apache.org/jira/browse/SPARK-9224 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Priority: Critical OnlineLDAOptimizer's current implementation can be improved by using in-place updating (instead of reassignment to vars), reducing number of transpositions, and an outer product (instead of looping) to collect stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8
[ https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636206#comment-14636206 ] Apache Spark commented on SPARK-9238: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/7582 two extra useless entries for bytesOfCodePointInUTF8 Key: SPARK-9238 URL: https://issues.apache.org/jira/browse/SPARK-9238 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Priority: Trivial Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in Description section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8
zhichao-li created SPARK-9238: - Summary: two extra useless entries for bytesOfCodePointInUTF8 Key: SPARK-9238 URL: https://issues.apache.org/jira/browse/SPARK-9238 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Priority: Trivial Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in Description section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9240) Hybrid aggregate operator
Yin Huai created SPARK-9240: --- Summary: Hybrid aggregate operator Key: SPARK-9240 URL: https://issues.apache.org/jira/browse/SPARK-9240 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker We need a hybrid aggregate operator, which first tries hash-based aggregations and gracefully switch to sort-based aggregations if the hash map's memory footprint exceeds a given threshold (how to track memory footprint and how to set the threshold?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9240) Hybrid aggregate operator using unsafe row
[ https://issues.apache.org/jira/browse/SPARK-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-9240: Summary: Hybrid aggregate operator using unsafe row (was: Hybrid aggregate operator) Hybrid aggregate operator using unsafe row -- Key: SPARK-9240 URL: https://issues.apache.org/jira/browse/SPARK-9240 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker We need a hybrid aggregate operator, which first tries hash-based aggregations and gracefully switch to sort-based aggregations if the hash map's memory footprint exceeds a given threshold (how to track memory footprint and how to set the threshold?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9244) Increase some default memory limits
Matei Zaharia created SPARK-9244: Summary: Increase some default memory limits Key: SPARK-9244 URL: https://issues.apache.org/jira/browse/SPARK-9244 Project: Spark Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. AFAIK the memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. - spark.executor.memory: Defaults at 512m, which is really small. We can at least increase it to 1g, though this is something users do need to set on their own. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636060#comment-14636060 ] Shivaram Venkataraman commented on SPARK-9053: -- Yeah - there are a bunch of real issues to be fixed first and we can discuss the ignore rule after that. Also I don't think we should ignore all warnings of this form -- just say on the `^` operator or we can mark out portions of the code that need to be ignored etc. Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8641) Native Spark Window Functions
[ https://issues.apache.org/jira/browse/SPARK-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-8641: - Description: *Rationale* The window operator currently uses Hive UDAFs for all aggregation operations. This is fine in terms of performance and functionality. However they limit extensibility, and they are quite opaque in terms of processing and memory usage. The later blocks advanced optimizations such as code generation and tungsten style (advanced) memory management. *Requirements* We want to adress this by replacing the Hive UDAFs with native Spark SQL UDAFs. A redesign of the Spark UDAFs is currently underway, see SPARK-4366. The new window UDAFs should use this new standard, in order to make them as future proof as possible. Although we are replacing the standard Hive UDAFs, other existing Hive UDAFs should still be supported. The new window UDAFs should, at least, cover all existing Hive standard window UDAFs: # FIRST_VALUE # LAST_VALUE # LEAD # LAG # ROW_NUMBER # RANK # DENSE_RANK # PERCENT_RANK # NTILE # CUME_DIST All these function imply a row order; this means that in order to use these functions properly an ORDER BY clause must be defined. The first and last value UDAFs are already present in Spark SQL. The only thing which needs to be added is skip NULL functionality. LEAD and LAG are not aggregates. These expressions return the value of an expression a number of rows before (LAG) or ahead (LEAD) of the current row. These expression put a constraint on the Window frame in which they are executed: this can only be a Row frame with equal offsets. The ROW_NUMBER() function can be seen as a count in a running row frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). RANK(), DENSE_RANK(), PERCENT_RANK(), NTILE(..) CUME_DIST() are dependent on the actual value of values in the ORDER BY clause. The ORDER BY expression(s) must be made available before these functions are evaluated. All these functions will have a fixed frame, but this will be dependent on the implementation (probably a running row frame). PERCENT_RANK(), NTILE(..) CUME_DIST() are also dependent on the size of the partition being evaluated. The partition size must either be made available during evaluation (this is perfectly feasible in the current implementation) or the expression must be divided over two window and a merging expression, for instance PERCENT_RANK() would look like this: {noformat} (RANK() OVER (PARTITION BY x ORDER BY y) - 1) / (COUNT(*) OVER (PARTITION BY x) - 1) {noformat} *Design* The old WindowFunction interface will be replaced by the following (initial/very early) design (including sub-classes): {noformat} /** * A window function is a function that can only be evaluated in the context of a window operator. */ trait WindowFunction { self: Expression = /** * Define the frame in which the window operator must be executed. */ def frame: WindowFrame = UnspecifiedFrame } /** * Base class for LEAD/LAG offset window functions. * * These are ordinary expressions, the idea is that the Window operator will process these in a * separate (specialized) window frame. */ abstract class OffsetWindowFunction(val child: Expression, val offset: Int, val default: Expression) { override def deterministic: Boolean = false ... } case class Lead(child: Expression, offset: Int, default: Expression) extends OffsetWindowFunction(child, offset, default) { override val frame = SpecifiedWindowFrame(RowFrame, ValuePreceding(offset), ValuePreceding(offset)) ... } case class Lag(child: Expression, offset: Int, default: Expression) extends OffsetWindowFunction(child, offset, default) { override val frame = SpecifiedWindowFrame(RowFrame, ValueFollowing(offset), ValueFollowing(offset)) ... } case class RowNumber() extends AlgebraicAggregate with WindowFunction { override def deterministic: Boolean = false override val frame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow) ... } abstact class RankLike(val order: Seq[Expression] = Nil) extends AlgebraicAggregate with WindowFunction { override def deterministic: Boolean = true // This can be injected by either the Planner or the Window operator. def withOrderSpec(orderSpec: Seq[Expression]): AggregateWindowFuntion // This will be injected by the Window operator. // Only needed by: PERCENT_RANK(), NTILE(..) CUME_DIST(). Maybe put this in a subclass. def withPartitionSize(size: MutableLiteral): AggregateWindowFuntion // We can do this as long as partition size is available before execution... override val frame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow) ... } case class Rank(order: Seq[Expression] = Nil) extends RankLike(order) { ... } case class DenseRank(order: Seq[Expression] = Nil) extends RankLike(order) { ... } case class
[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7368: - Shepherd: Xiangrui Meng add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7368: - Assignee: yuhao yang add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: Improvement Components: MLlib Reporter: yuhao yang Assignee: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7368: - Issue Type: New Feature (was: Improvement) add QR decomposition for RowMatrix -- Key: SPARK-7368 URL: https://issues.apache.org/jira/browse/SPARK-7368 Project: Spark Issue Type: New Feature Components: MLlib Reporter: yuhao yang Assignee: yuhao yang Original Estimate: 48h Remaining Estimate: 48h Add QR decomposition for RowMatrix. There's a great distributed algorithm for QR decomposition, which I'm currently referring to. Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9239) HiveUDAF support for AggregateFunction2
Yin Huai created SPARK-9239: --- Summary: HiveUDAF support for AggregateFunction2 Key: SPARK-9239 URL: https://issues.apache.org/jira/browse/SPARK-9239 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker We need to build a wrapper for Hive UDAFs on top of AggregateFunction2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636321#comment-14636321 ] Apache Spark commented on SPARK-9053: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/7584 Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9053: --- Assignee: (was: Apache Spark) Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9121. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7567 [https://github.com/apache/spark/pull/7567] Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Fix For: 1.5.0 We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR
[ https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-9121: - Assignee: Yu Ishikawa Get rid of the warnings about `no visible global function definition` in SparkR --- Key: SPARK-9121 URL: https://issues.apache.org/jira/browse/SPARK-9121 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Yu Ishikawa Assignee: Yu Ishikawa Fix For: 1.5.0 We have a lot of warnings about {{no visible global function definition}} in SparkR. So we should get rid of them. {noformat} R/utils.R:513:5: warning: no visible global function definition for ‘processClosure’ processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv) ^~ {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9236) Left Outer Join with empty JavaPairRDD returns empty RDD
Vitalii Slobodianyk created SPARK-9236: -- Summary: Left Outer Join with empty JavaPairRDD returns empty RDD Key: SPARK-9236 URL: https://issues.apache.org/jira/browse/SPARK-9236 Project: Spark Issue Type: Bug Affects Versions: 1.4.1, 1.3.1 Reporter: Vitalii Slobodianyk When the *left outer join* is performed on a non-empty {{JavaPairRDD}} with a {{JavaPairRDD}} which was created with the {{emptyRDD()}} method the resulting RDD is empty. In the following unit test the latest assert fails. {code} import static org.assertj.core.api.Assertions.assertThat; import java.util.Collections; import lombok.val; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.junit.Test; import scala.Tuple2; public class SparkTest { @Test public void joinEmptyRDDTest() { val sparkConf = new SparkConf().setAppName(test).setMaster(local); try (val sparkContext = new JavaSparkContext(sparkConf)) { val oneRdd = sparkContext.parallelize(Collections.singletonList(one)); val twoRdd = sparkContext.parallelize(Collections.singletonList(two)); val threeRdd = sparkContext.emptyRDD(); val onePair = oneRdd.mapToPair(t - new Tuple2Integer, String(1, t)); val twoPair = twoRdd.groupBy(t - 1); val threePair = threeRdd.groupBy(t - 1); assertThat(onePair.leftOuterJoin(twoPair).collect()).isNotEmpty(); assertThat(onePair.leftOuterJoin(threePair).collect()).isNotEmpty(); } } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8
[ https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9238: --- Assignee: (was: Apache Spark) two extra useless entries for bytesOfCodePointInUTF8 Key: SPARK-9238 URL: https://issues.apache.org/jira/browse/SPARK-9238 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Priority: Trivial Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in Description section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8
[ https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9238: --- Assignee: Apache Spark two extra useless entries for bytesOfCodePointInUTF8 Key: SPARK-9238 URL: https://issues.apache.org/jira/browse/SPARK-9238 Project: Spark Issue Type: Bug Components: SQL Reporter: zhichao-li Assignee: Apache Spark Priority: Trivial Only a trial thing, not sure if I understand correctly or not but I guess only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(110x) is enough. Details can be found from https://en.wikipedia.org/wiki/UTF-8 in Description section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8232) complex function: sort_array
[ https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636146#comment-14636146 ] Apache Spark commented on SPARK-8232: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/7581 complex function: sort_array Key: SPARK-8232 URL: https://issues.apache.org/jira/browse/SPARK-8232 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao sort_array(ArrayT) Sorts the input array in ascending order according to the natural ordering of the array elements and returns it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8232) complex function: sort_array
[ https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8232: --- Assignee: Apache Spark (was: Cheng Hao) complex function: sort_array Key: SPARK-8232 URL: https://issues.apache.org/jira/browse/SPARK-8232 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark sort_array(ArrayT) Sorts the input array in ascending order according to the natural ordering of the array elements and returns it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8232) complex function: sort_array
[ https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8232: --- Assignee: Cheng Hao (was: Apache Spark) complex function: sort_array Key: SPARK-8232 URL: https://issues.apache.org/jira/browse/SPARK-8232 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao sort_array(ArrayT) Sorts the input array in ascending order according to the natural ordering of the array elements and returns it -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4366) Aggregation Improvement
[ https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-4366: Priority: Critical (was: Major) Aggregation Improvement --- Key: SPARK-4366 URL: https://issues.apache.org/jira/browse/SPARK-4366 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Critical Attachments: aggregatefunction_v1.pdf This improvement actually includes couple of sub tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9053) Fix spaces around parens, infix operators etc.
[ https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9053: --- Assignee: Apache Spark Fix spaces around parens, infix operators etc. -- Key: SPARK-9053 URL: https://issues.apache.org/jira/browse/SPARK-9053 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Apache Spark We have a number of style errors which look like {code} Place a space before left parenthesis ... Put spaces around all infix operators. {code} However some of the warnings are spurious (example space around infix operator in {code} expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], sqrt(4^2 + 8^2)) {code} We should add a ignore rule for these spurious examples -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9210) checkValidAggregateExpression() throws exceptions with bad error messages
[ https://issues.apache.org/jira/browse/SPARK-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636115#comment-14636115 ] Simeon Simeonov commented on SPARK-9210: Standalone test demonstrating the problem spark-shell output: https://gist.github.com/ssimeonov/72c8a9b01f99e35ba470 checkValidAggregateExpression() throws exceptions with bad error messages - Key: SPARK-9210 URL: https://issues.apache.org/jira/browse/SPARK-9210 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.1 Environment: N/A Reporter: Simeon Simeonov Priority: Trivial When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP BY}} expressions nor uses an aggregation function, {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws {{org.apache.spark.sql.AnalysisException}} with the message expression '_column expression_' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() if you don't care which value you get. The remedy suggestion in the exception message incorrect: the function name is {{first_value}}, not {{first}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4367) Partial aggregation support the DISTINCT aggregation
[ https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4367: --- Assignee: (was: Apache Spark) Partial aggregation support the DISTINCT aggregation Key: SPARK-4367 URL: https://issues.apache.org/jira/browse/SPARK-4367 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Most of aggregate function(e.g average) with distinct value will requires all of the records in the same group to be shuffled into a single node, however, as part of the optimization, those records can be partially aggregated before shuffling, that probably reduces the overhead of shuffling significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3056) Sort-based Aggregation
[ https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-3056: --- Assignee: (was: Apache Spark) Sort-based Aggregation -- Key: SPARK-3056 URL: https://issues.apache.org/jira/browse/SPARK-3056 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Cheng Hao Currently, SparkSQL only support the hash-based aggregation, which may cause OOM if too many identical keys in the input tuples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9227) Add option to set logging level in Spark Context Constructor
[ https://issues.apache.org/jira/browse/SPARK-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635655#comment-14635655 ] Sean Owen commented on SPARK-9227: -- Why? there's already logging framework APIs for this, both config-driven and programmatic. Add option to set logging level in Spark Context Constructor Key: SPARK-9227 URL: https://issues.apache.org/jira/browse/SPARK-9227 Project: Spark Issue Type: Wish Reporter: Auberon López Priority: Minor It would be nice to be able to set the logging level in the constructor of a Spark Context. This provides a cleaner interface than needing to call setLoggingLevel after the context is already created. It would be especially helpful in a REPL environment where logging can clutter up the terminal and make it confusing for the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9227) Add option to set logging level in Spark Context Constructor
Auberon López created SPARK-9227: Summary: Add option to set logging level in Spark Context Constructor Key: SPARK-9227 URL: https://issues.apache.org/jira/browse/SPARK-9227 Project: Spark Issue Type: Wish Reporter: Auberon López Priority: Minor It would be nice to be able to set the logging level in the constructor of a Spark Context. This provides a cleaner interface than needing to call setLoggingLevel after the context is already created. It would be especially helpful in a REPL environment where logging can clutter up the terminal and make it confusing for the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9131) UDFs change data values
[ https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635690#comment-14635690 ] Reynold Xin commented on SPARK-9131: I see. Even if we have a relatively large dataset, as long as we can reproduce it, it'd be great to have. UDFs change data values --- Key: SPARK-9131 URL: https://issues.apache.org/jira/browse/SPARK-9131 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.4.0, 1.4.1 Environment: Pyspark 1.4 and 1.4.1 Reporter: Luis Guerra Priority: Critical I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you my code below: {code} c= a.join(b, a['ID'] == b['ID_new'], 'inner') c.filter(c['ID'] == '62698917').show() udf_A = UserDefinedFunction(lambda x: x, DateType()) udf_B = UserDefinedFunction(lambda x: x, DateType()) udf_C = UserDefinedFunction(lambda x: x, DateType()) d = c.select(c['ID'], c['t1'].alias('ta'), udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), udf_C(vinc_muestra['t2']).alias('td')) d.filter(d['ID'] == '62698917').show() {code} I am showing here the results from the outputs: {code} +++--+--+ | ID | ID_new | t1 | t2 | +++--+--+ |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-20| 2013-02-20| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-28| 2014-02-28| |62698917| 62698917| 2012-02-20| 2013-02-20| +++--+--+ ++---+---+++ | ID| ta | tb| tc| td | ++---+---+++ |62698917| 2012-02-28| 2007-03-05|2003-03-05| 2014-02-28| |62698917| 2012-02-20| 2007-02-15|2002-02-15| 2013-02-20| |62698917| 2012-02-28| 2007-03-10|2005-03-10| 2014-02-28| |62698917| 2012-02-20| 2007-03-05|2003-03-05| 2013-02-20| |62698917| 2012-02-20| 2013-08-02|2013-01-02| 2013-02-20| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-28| 2007-02-15|2002-02-15| 2014-02-28| |62698917| 2012-02-20| 2014-01-02|2013-01-02| 2013-02-20| ++---+---+++ {code} The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 'd' are completely different from values 't1' and 't2' in dataframe c even when my udfs are doing nothing. It seems like if values were somehow got from other registers (or just invented). Results are different between executions (apparently random). Thanks in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9229) pyspark yarn-cluster PYSPARK_PYTHON not set
Eric Kimbrel created SPARK-9229: --- Summary: pyspark yarn-cluster PYSPARK_PYTHON not set Key: SPARK-9229 URL: https://issues.apache.org/jira/browse/SPARK-9229 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Environment: centos Reporter: Eric Kimbrel PYSPARK_PYTHON is set in spark-env.sh to use an alternative python installation. Use spark-submit to run a pyspark job in yarn with cluster deploy mode. PYSPARK_PTYHON is not set in the cluster environment, and the system default python is used instead of the intended original. test code: (simple.py) from pyspark import SparkConf, SparkContext import sys,os conf = SparkConf() sc = SparkContext(conf=conf) out = [('PYTHON VERSION',str(sys.version))] out.extend( zip( os.environ.keys(),os.environ.values() ) ) rdd = sc.parallelize(out) rdd.coalesce(1).saveAsTextFile(hdfs://namenode/tmp/env) submit command: spark-submit --master yarn --deploy-mode cluster --num-executors 1 simple.py I've also tried setting PYSPARK_PYTHON on the command line with no effect. It seems like there is no way to specify an alternative python executable in yarn-cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9229) pyspark yarn-cluster PYSPARK_PYTHON not set
[ https://issues.apache.org/jira/browse/SPARK-9229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Kimbrel updated SPARK-9229: Environment: centos Cloudera 5.4.1 based off Apache Hadoop 2.6.0, using spark 1.5.0 built for hadoop 2.6.0 from github master branch on 7.20.2015 (was: centos ) pyspark yarn-cluster PYSPARK_PYTHON not set Key: SPARK-9229 URL: https://issues.apache.org/jira/browse/SPARK-9229 Project: Spark Issue Type: Bug Affects Versions: 1.5.0 Environment: centos Cloudera 5.4.1 based off Apache Hadoop 2.6.0, using spark 1.5.0 built for hadoop 2.6.0 from github master branch on 7.20.2015 Reporter: Eric Kimbrel PYSPARK_PYTHON is set in spark-env.sh to use an alternative python installation. Use spark-submit to run a pyspark job in yarn with cluster deploy mode. PYSPARK_PTYHON is not set in the cluster environment, and the system default python is used instead of the intended original. test code: (simple.py) from pyspark import SparkConf, SparkContext import sys,os conf = SparkConf() sc = SparkContext(conf=conf) out = [('PYTHON VERSION',str(sys.version))] out.extend( zip( os.environ.keys(),os.environ.values() ) ) rdd = sc.parallelize(out) rdd.coalesce(1).saveAsTextFile(hdfs://namenode/tmp/env) submit command: spark-submit --master yarn --deploy-mode cluster --num-executors 1 simple.py I've also tried setting PYSPARK_PYTHON on the command line with no effect. It seems like there is no way to specify an alternative python executable in yarn-cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8357) Memory leakage on unsafe aggregation path with empty input
[ https://issues.apache.org/jira/browse/SPARK-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8357. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7560 [https://github.com/apache/spark/pull/7560] Memory leakage on unsafe aggregation path with empty input -- Key: SPARK-8357 URL: https://issues.apache.org/jira/browse/SPARK-8357 Project: Spark Issue Type: Bug Components: SQL Reporter: Navis Assignee: Navis Priority: Critical Fix For: 1.5.0 Currently, unsafe-based hash is released on 'next' call but if input is empty, it would not be called ever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9226) Change default log level to WARN in python REPL
Auberon López created SPARK-9226: Summary: Change default log level to WARN in python REPL Key: SPARK-9226 URL: https://issues.apache.org/jira/browse/SPARK-9226 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Auberon López Priority: Minor Fix For: 1.5.0 SPARK-7261 provides separate logging properties to be used when in the scala REPL, by default changing the logging level to WARN instead of INFO. This same improvement can be implemented for the Python REPL, which will make using Pyspark interactively a cleaner experience that is closer to parity with the scala shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9154) Implement code generation for StringFormat
[ https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635618#comment-14635618 ] Apache Spark commented on SPARK-9154: - User 'tarekauel' has created a pull request for this issue: https://github.com/apache/spark/pull/7571 Implement code generation for StringFormat -- Key: SPARK-9154 URL: https://issues.apache.org/jira/browse/SPARK-9154 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Tarek Auel Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9227) Add option to set logging level in Spark Context Constructor
[ https://issues.apache.org/jira/browse/SPARK-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635665#comment-14635665 ] Marcelo Vanzin commented on SPARK-9227: --- A programmatic API for this is overkill. If you want to do something, I'd suggest making the log level in the default log4j config a variable, so that you can override it by setting a system property. No need for API changes to make that work. Add option to set logging level in Spark Context Constructor Key: SPARK-9227 URL: https://issues.apache.org/jira/browse/SPARK-9227 Project: Spark Issue Type: Wish Reporter: Auberon López Priority: Minor It would be nice to be able to set the logging level in the constructor of a Spark Context. This provides a cleaner interface than needing to call setLoggingLevel after the context is already created. It would be especially helpful in a REPL environment where logging can clutter up the terminal and make it confusing for the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9228) Adjust Spark SQL Configs
Michael Armbrust created SPARK-9228: --- Summary: Adjust Spark SQL Configs Key: SPARK-9228 URL: https://issues.apache.org/jira/browse/SPARK-9228 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Before QA, lets flip on features and consolidate unsafe and codegen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames
[ https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635597#comment-14635597 ] Michael Armbrust commented on SPARK-8007: - I'm going to propose that we don't change the analyzer, but instead just use functions for all the cases that were specified. This is nice because we can never be ambiguous with a user column. Support resolving virtual columns in DataFrames --- Key: SPARK-8007 URL: https://issues.apache.org/jira/browse/SPARK-8007 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Joseph Batchik Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to SparkPartitionID expression. A cool use case is to understand physical data skew: {code} df.groupBy(SPARK__PARTITION__ID).count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9213) Improve regular expression performance (via joni)
[ https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-9213: --- Description: I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby.] Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala was: I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby.] Improve regular expression performance (via joni) - Key: SPARK-9213 URL: https://issues.apache.org/jira/browse/SPARK-9213 Project: Spark Issue Type: Umbrella Components: SQL Reporter: Reynold Xin I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons: 1. Java regex in general is slow. 2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back. There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby.] Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures
[ https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635663#comment-14635663 ] Apache Spark commented on SPARK-8103: - User 'markhamstra' has created a pull request for this issue: https://github.com/apache/spark/pull/7572 DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures --- Key: SPARK-8103 URL: https://issues.apache.org/jira/browse/SPARK-8103 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.4.0 Reporter: Imran Rashid Assignee: Imran Rashid Fix For: 1.5.0 When there is a fetch failure, {{DAGScheduler}} is supposed to fail the stage, retry the necessary portions of the preceding shuffle stage which generated the shuffle data, and eventually rerun the stage. We generally expect to get multiple fetch failures together, but only want to re-start the stage once. The code already makes an attempt to address this https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108 . {code} // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is possible // the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { {code} However, this logic is flawed because the stage may have been **resubmitted** by the time we get these fetch failures. In that case, {{runningStages.contains(failedStage)}} will be true, but we've already handled these failures. This results in multiple concurrent non-zombie attempts for one stage. In addition to being very confusing, and a waste of resources, this also can lead to later stages being submitted before the previous stage has registered its map output. This happens because (a) when one attempt finishes all its tasks, it may not register its map output because the stage still has pending tasks, from other attempts https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046 {code} if (runningStages.contains(shuffleStage) shuffleStage.pendingTasks.isEmpty) { {code} and (b) {{submitStage}} thinks the following stage is ready to go, because {{getMissingParentStages}} thinks the stage is complete as long it has all of its map outputs: https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397 {code} if (!mapStage.isAvailable) { missing += mapStage } {code} So the following stage is submitted repeatedly, but it is doomed to fail because its shuffle output has never been registered with the map output tracker. Here's an example failure in this case: {noformat} WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message= org.apache.spark.shuffle.MetadataFetchFailedException: Missing output locations for shuffle ... {noformat} Note that this is a subset of the problems originally described in SPARK-7308, limited to just the issues effecting the DAGScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9233) Enable code-gen in window function unit tests
Yin Huai created SPARK-9233: --- Summary: Enable code-gen in window function unit tests Key: SPARK-9233 URL: https://issues.apache.org/jira/browse/SPARK-9233 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Right now, our {{HiveWindowFunctionQuerySuite.scala}} set code-gen to false, since code-gen is enabled by default, we need to enable code-gen for tests in this file and fix bugs we find. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering
[ https://issues.apache.org/jira/browse/SPARK-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635729#comment-14635729 ] Apache Spark commented on SPARK-9222: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/7573 Make class instantiation variables in DistributedLDAModel [private] clustering -- Key: SPARK-9222 URL: https://issues.apache.org/jira/browse/SPARK-9222 Project: Spark Issue Type: Test Components: MLlib Reporter: Manoj Kumar Priority: Minor This would enable testing the various class variables like docConcentration, topicConcentration etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org