[jira] [Resolved] (SPARK-9081) fillna/dropna should also fill/drop NaN values in addition to null values

2015-07-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9081.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7523
[https://github.com/apache/spark/pull/7523]

 fillna/dropna should also fill/drop NaN values in addition to null values
 -

 Key: SPARK-9081
 URL: https://issues.apache.org/jira/browse/SPARK-9081
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8915:
-
Assignee: Xiangrui Meng

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8641) Native Spark Window Functions

2015-07-21 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635205#comment-14635205
 ] 

Herman van Hovell commented on SPARK-8641:
--

We need to wait for the new UDAF interface to stabilize. Special attention 
needs to be paid to following aspects:
* Hive UDAFs
* Difference in processing an AlgebraicAggregate, AggregateFunction2  
(potentially) AggregateFunction
* Common aggregate processing functionality.

 Native Spark Window Functions
 -

 Key: SPARK-8641
 URL: https://issues.apache.org/jira/browse/SPARK-8641
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Herman van Hovell

 The current Window implementation uses Hive UDAFs for all aggregation 
 operations. In this ticket we will move to this functionality to Native Spark 
 Expressions. The rationale for this is that although Hive UDAFs are very well 
 written, they remain opaque in processing and memory management; this makes 
 them hard to optimize.
 This ticket and its PR will build on the work being done in SPARK-4366.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9019) spark-submit fails on yarn with kerberos enabled

2015-07-21 Thread Bolke de Bruin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bolke de Bruin updated SPARK-9019:
--
Attachment: debug-log-spark-1.5-fail
spark-submit-log-1.5.0-fail

 spark-submit fails on yarn with kerberos enabled
 

 Key: SPARK-9019
 URL: https://issues.apache.org/jira/browse/SPARK-9019
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.5.0
 Environment: Hadoop 2.6 with YARN and kerberos enabled
Reporter: Bolke de Bruin
  Labels: kerberos, spark-submit, yarn
 Attachments: debug-log-spark-1.5-fail, spark-submit-log-1.5.0-fail


 It is not possible to run jobs using spark-submit on yarn with a kerberized 
 cluster. 
 Commandline:
 /usr/hdp/2.2.0.0-2041/spark-1.5.0/bin/spark-submit --principal sparkjob 
 --keytab sparkjob.keytab --num-executors 3 --executor-cores 5 
 --executor-memory 5G --master yarn-cluster /tmp/get_peers.py 
 Fails with:
 15/07/13 22:48:31 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/07/13 22:48:31 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:58380
 15/07/13 22:48:31 INFO util.Utils: Successfully started service 'SparkUI' on 
 port 58380.
 15/07/13 22:48:31 INFO ui.SparkUI: Started SparkUI at 
 http://10.111.114.9:58380
 15/07/13 22:48:31 INFO cluster.YarnClusterScheduler: Created 
 YarnClusterScheduler
 15/07/13 22:48:31 WARN metrics.MetricsSystem: Using default name DAGScheduler 
 for source because spark.app.id is not set.
 15/07/13 22:48:32 INFO util.Utils: Successfully started service 
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43470.
 15/07/13 22:48:32 INFO netty.NettyBlockTransferService: Server created on 
 43470
 15/07/13 22:48:32 INFO storage.BlockManagerMaster: Trying to register 
 BlockManager
 15/07/13 22:48:32 INFO storage.BlockManagerMasterEndpoint: Registering block 
 manager 10.111.114.9:43470 with 265.1 MB RAM, BlockManagerId(driver, 
 10.111.114.9, 43470)
 15/07/13 22:48:32 INFO storage.BlockManagerMaster: Registered BlockManager
 15/07/13 22:48:32 INFO impl.TimelineClientImpl: Timeline service address: 
 http://lxhnl002.ad.ing.net:8188/ws/v1/timeline/
 15/07/13 22:48:33 WARN ipc.Client: Exception encountered while connecting to 
 the server : org.apache.hadoop.security.AccessControlException: Client cannot 
 authenticate via:[TOKEN, KERBEROS]
 15/07/13 22:48:33 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
 to rm2
 15/07/13 22:48:33 INFO retry.RetryInvocationHandler: Exception while invoking 
 getClusterNodes of class ApplicationClientProtocolPBClientImpl over rm2 after 
 1 fail over attempts. Trying to fail over after sleeping for 32582ms.
 java.net.ConnectException: Call From lxhnl006.ad.ing.net/10.111.114.9 to 
 lxhnl013.ad.ing.net:8032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
   at org.apache.hadoop.ipc.Client.call(Client.java:1472)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy24.getClusterNodes(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:262)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
   at com.sun.proxy.$Proxy25.getClusterNodes(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNodeReports(YarnClientImpl.java:475)
   at 
 org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend$$anonfun$getDriverLogUrls$1.apply(YarnClusterSchedulerBackend.scala:92)
   at 
 

[jira] [Resolved] (SPARK-9193) Avoid assigning tasks to executors under killing

2015-07-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-9193.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7528
[https://github.com/apache/spark/pull/7528]

 Avoid assigning tasks to executors under killing
 

 Key: SPARK-9193
 URL: https://issues.apache.org/jira/browse/SPARK-9193
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.4.0, 1.4.1
Reporter: Jie Huang
Assignee: Jie Huang
 Fix For: 1.5.0


 Now, when some executors are killed by dynamic-allocation, it leads to some 
 mis-assignment onto lost executors sometimes. Such kind of mis-assignment 
 causes task failure(s) or even job failure if it repeats that errors for 4 
 times.
 The root cause is that killExecutors doesn't remove those executors under 
 killing ASAP. It depends on the OnDisassociated event to refresh the active 
 working list later. The delay time really depends on your cluster status 
 (from several milliseconds to sub-minute). When new tasks to be scheduled 
 during that period of time, it will be assigned to those active but under 
 killing executors. Then the tasks will be failed due to executor lost. The 
 better way is to exclude those executors under killing in the makeOffers(). 
 Then all those tasks won't be allocated onto those executors to be lost any 
 more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9220) Streaming K-means implementation exception while processing windowed stream

2015-07-21 Thread Iaroslav Zeigerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635240#comment-14635240
 ] 

Iaroslav Zeigerman commented on SPARK-9220:
---

Looks like the issue reproduces only when training and test data streams are 
linked to the same directory. Can someone confirm if this cause the issue?

 Streaming K-means implementation exception while processing windowed stream
 ---

 Key: SPARK-9220
 URL: https://issues.apache.org/jira/browse/SPARK-9220
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Affects Versions: 1.4.1
Reporter: Iaroslav Zeigerman

 Spark throws an exception when the Streaming K-means algorithm trains on a 
 windowed stream. The stream looks like following:
 {{val trainingSet = 
 ssc.textFileStream(TrainingDataSet).window(Seconds(30))...}}
 The exception occurs when there is no new data in the stream. Here is an 
 exception:
 15/07/21 17:36:08 ERROR JobScheduler: Error running job streaming job 
 1437489368000 ms.0
 java.lang.ArrayIndexOutOfBoundsException: 13
   at 
 org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:105)
   at 
 org.apache.spark.mllib.clustering.StreamingKMeansModel$$anonfun$update$1.apply(StreamingKMeans.scala:102)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.mllib.clustering.StreamingKMeansModel.update(StreamingKMeans.scala:102)
   at 
 org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:235)
   at 
 org.apache.spark.mllib.clustering.StreamingKMeans$$anonfun$trainOn$1.apply(StreamingKMeans.scala:234)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
   at 
 org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
   at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
   at scala.util.Try$.apply(Try.scala:161)
   at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
   at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 When the new data arrives the algorithm works as expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9168) Add nanvl expression

2015-07-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9168.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7523
[https://github.com/apache/spark/pull/7523]

 Add nanvl expression
 

 Key: SPARK-9168
 URL: https://issues.apache.org/jira/browse/SPARK-9168
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen
 Fix For: 1.5.0


 Similar to Oracle's nanvl:
 nanvl(v1, v2)
 if v1 is NaN, returns v2; otherwise, returns v1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9221) Support IntervalType in Range Frame

2015-07-21 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-9221:


 Summary: Support IntervalType in Range Frame
 Key: SPARK-9221
 URL: https://issues.apache.org/jira/browse/SPARK-9221
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Herman van Hovell


Support the IntervalType in window range frames, as mentioned in the conclusion 
of the databricks  blog 
[post|https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html]
 on window functions.

This actualy requires us to support Literals instead of Integer constants in 
Range Frames. The following things will have to be modified:
* org.apache.spark.sql.hive.HiveQl
* org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame
* org.apache.spark.sql.execution.Window
* org.apache.spark.sql.expressions.Window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8915:
-
Assignee: Patrick Baier  (was: Xiangrui Meng)

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Patrick Baier
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8915) Add @since tags to mllib.classification

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8915:
-
Shepherd: DB Tsai

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Patrick Baier
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8922) Add @since tags to mllib.evaluation

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8922:
-
Shepherd: Shuo Xiang

 Add @since tags to mllib.evaluation
 ---

 Key: SPARK-8922
 URL: https://issues.apache.org/jira/browse/SPARK-8922
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR

2015-07-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635300#comment-14635300
 ] 

Shivaram Venkataraman commented on SPARK-9121:
--

Yeah we can add `install-dev.sh` in Jenkins before dev/lint-r. One unfortunate 
thing is that we typically do a lint-check before we run the rest of the 
Jenkins tests (build, unit tests etc.) So it would be good to not have this be 
the other way around I guess

 Get rid of the warnings about `no visible global function definition` in 
 SparkR
 ---

 Key: SPARK-9121
 URL: https://issues.apache.org/jira/browse/SPARK-9121
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa

 We have a lot of warnings about {{no visible global function definition}} in 
 SparkR. So we should get rid of them.
 {noformat}
 R/utils.R:513:5: warning: no visible global function definition for 
 ‘processClosure’
 processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv)
 ^~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9210) checkValidAggregateExpression() throws exceptions with bad error messages

2015-07-21 Thread Simeon Simeonov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simeon Simeonov updated SPARK-9210:
---
Description: 
When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP 
BY}} expressions nor uses an aggregation function, 
{{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws 
{{org.apache.spark.sql.AnalysisException}} with the message expression 
'_column expression_' is neither present in the group by, nor is it an 
aggregate function. Add to group by or wrap in first() if you don't care which 
value you get.

The remedy suggestion in the exception message is incorrect: the function name 
is {{first_value}}, not {{first}}.

  was:
When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP 
BY}} expressions nor uses an aggregation function, 
{{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws 
{{org.apache.spark.sql.AnalysisException}} with the message expression 
'_column expression_' is neither present in the group by, nor is it an 
aggregate function. Add to group by or wrap in first() if you don't care which 
value you get.

The remedy suggestion in the exception message incorrect: the function name is 
{{first_value}}, not {{first}}.


 checkValidAggregateExpression() throws exceptions with bad error messages
 -

 Key: SPARK-9210
 URL: https://issues.apache.org/jira/browse/SPARK-9210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: N/A
Reporter: Simeon Simeonov
Priority: Trivial

 When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP 
 BY}} expressions nor uses an aggregation function, 
 {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws 
 {{org.apache.spark.sql.AnalysisException}} with the message expression 
 '_column expression_' is neither present in the group by, nor is it an 
 aggregate function. Add to group by or wrap in first() if you don't care 
 which value you get.
 The remedy suggestion in the exception message is incorrect: the function 
 name is {{first_value}}, not {{first}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-07-21 Thread Joseph Batchik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636180#comment-14636180
 ] 

Joseph Batchik commented on SPARK-8668:
---

Does this look like what you were thinking?

https://github.com/JDrit/spark/commit/7fcf18a11427709d403418da8d444b434c63

 expr function to convert SQL expression into a Column
 -

 Key: SPARK-8668
 URL: https://issues.apache.org/jira/browse/SPARK-8668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 selectExpr uses the expression parser to parse a string expressions. would be 
 great to create an expr function in functions.scala/functions.py that 
 converts a string into an expression (or a list of expressions separated by 
 comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9241) Supporting multiple DISTINCT columns

2015-07-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9241:
---

 Summary: Supporting multiple DISTINCT columns
 Key: SPARK-9241
 URL: https://issues.apache.org/jira/browse/SPARK-9241
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Critical


Right now the new aggregation code path only support a single distinct column 
(you can use it in multiple aggregate functions in the query). We need to 
support multiple distinct columns by generating a different plan for handling 
multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9242) Audit both built-in aggregate function and UDAF interface before 1.5.0 release

2015-07-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9242:
---

 Summary: Audit both built-in aggregate function and UDAF interface 
before 1.5.0 release
 Key: SPARK-9242
 URL: https://issues.apache.org/jira/browse/SPARK-9242
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8668) expr function to convert SQL expression into a Column

2015-07-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636324#comment-14636324
 ] 

Reynold Xin commented on SPARK-8668:


Yes - the only thing is you cannot split blindly by comma since commas can be 
in quotes. I think it is ok for the first cut to not support list of 
expressions separated by comma.


 expr function to convert SQL expression into a Column
 -

 Key: SPARK-8668
 URL: https://issues.apache.org/jira/browse/SPARK-8668
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 selectExpr uses the expression parser to parse a string expressions. would be 
 great to create an expr function in functions.scala/functions.py that 
 converts a string into an expression (or a list of expressions separated by 
 comma).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9237) Added Top N Column Values for DataFrames

2015-07-21 Thread Ted Malaska (JIRA)
Ted Malaska created SPARK-9237:
--

 Summary: Added Top N Column Values for DataFrames
 Key: SPARK-9237
 URL: https://issues.apache.org/jira/browse/SPARK-9237
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Malaska
Priority: Minor


This jira is to add a very common data quality check into dataframes.

A quick outline of this functionality can be seen in the following blog post
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

There are two parts to this Jira.
1. How to implement the Top N Count.  Which I will start with the 
implementation in the blog
2. Where to add the function.  Ether straight off Dataframe, in Dataframe 
describe or in DataFrameStatFunctions.  I will start with putting it into 
DataFrameStatFunctions.

Please let me know if you have any input.

Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3056) Sort-based Aggregation

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3056:
---

Assignee: Apache Spark

 Sort-based Aggregation
 --

 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 Currently, SparkSQL only support the hash-based aggregation, which may cause 
 OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3947) Support UDAF

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636199#comment-14636199
 ] 

Apache Spark commented on SPARK-3947:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7458

 Support UDAF
 

 Key: SPARK-3947
 URL: https://issues.apache.org/jira/browse/SPARK-3947
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Pei-Lun Lee
Assignee: Yin Huai

 Right now only Hive UDAFs are supported. It would be nice to have UDAF 
 similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3056) Sort-based Aggregation

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636200#comment-14636200
 ] 

Apache Spark commented on SPARK-3056:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7458

 Sort-based Aggregation
 --

 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, SparkSQL only support the hash-based aggregation, which may cause 
 OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4367) Partial aggregation support the DISTINCT aggregation

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636198#comment-14636198
 ] 

Apache Spark commented on SPARK-4367:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7458

 Partial aggregation support the DISTINCT aggregation
 

 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Most of aggregate function(e.g average) with distinct value will requires 
 all of the records in the same group to be shuffled into a single node, 
 however, as part of the optimization, those records can be partially 
 aggregated before shuffling, that probably reduces the overhead of shuffling 
 significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4367) Partial aggregation support the DISTINCT aggregation

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4367:
---

Assignee: Apache Spark

 Partial aggregation support the DISTINCT aggregation
 

 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark

 Most of aggregate function(e.g average) with distinct value will requires 
 all of the records in the same group to be shuffled into a single node, 
 however, as part of the optimization, those records can be partially 
 aggregated before shuffling, that probably reduces the overhead of shuffling 
 significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4233) Simplify the Aggregation Function implementation

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636197#comment-14636197
 ] 

Apache Spark commented on SPARK-4233:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7458

 Simplify the Aggregation Function implementation
 

 Key: SPARK-4233
 URL: https://issues.apache.org/jira/browse/SPARK-4233
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, the UDAF implementation is quite complicated, and we have to 
 provide distinct  non-distinct version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9243) Update crosstab doc for pairs that have no occurrences

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9243:
-
Component/s: Documentation

 Update crosstab doc for pairs that have no occurrences
 --

 Key: SPARK-9243
 URL: https://issues.apache.org/jira/browse/SPARK-9243
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark, SparkR, SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng

 The crosstab value for pairs that have no occurrences was changed from null 
 to 0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9243) Update crosstab doc for pairs that have no occurrences

2015-07-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9243:


 Summary: Update crosstab doc for pairs that have no occurrences
 Key: SPARK-9243
 URL: https://issues.apache.org/jira/browse/SPARK-9243
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR, SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng


The crosstab value for pairs that have no occurrences was changed from null to 
0 in SPARK-7982. We should update the doc in Scala, Python, and SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8915) Add @since tags to mllib.classification

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8915.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7371
[https://github.com/apache/spark/pull/7371]

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9036) SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-9036.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 SparkListenerExecutorMetricsUpdate messages not included in JsonProtocol
 

 Key: SPARK-9036
 URL: https://issues.apache.org/jira/browse/SPARK-9036
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Ryan Williams
Priority: Minor
 Fix For: 1.5.0


 The JsonProtocol added in SPARK-3454 [doesn't 
 include|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L95-L96]
  code for ser/de of 
 [{{SparkListenerExecutorMetricsUpdate}}|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L107-L110]
  messages.
 The comment notes that they are not used, which presumably refers to the 
 fact that the [{{EventLoggingListener}} doesn't write these 
 events|https://github.com/apache/spark/blob/v1.4.1-rc4/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L200-L201].
 However, individual listeners can and should make that determination for 
 themselves; I have recently written custom listeners that would like to 
 consume metrics-update messages as JSON, so it would be nice to round out the 
 JsonProtocol implementation by supporting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5423) ExternalAppendOnlyMap won't delete temp spilled file if some exception happens during using it

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5423.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it 
 ---

 Key: SPARK-5423
 URL: https://issues.apache.org/jira/browse/SPARK-5423
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 ExternalAppendOnlyMap won't delete temp spilled file if some exception 
 happens during using it.
 There is already a TODO in the comment:
 {code}
 // TODO: Ensure this gets called even if the iterator isn't drained.
 private def cleanup() {
   batchIndex = batchOffsets.length  // Prevent reading any other batch
   val ds = deserializeStream
   deserializeStream = null
   fileStream = null
   ds.close()
   file.delete()
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9154) Implement code generation for StringFormat

2015-07-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-9154.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7546
[https://github.com/apache/spark/pull/7546]

 Implement code generation for StringFormat
 --

 Key: SPARK-9154
 URL: https://issues.apache.org/jira/browse/SPARK-9154
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8231) complex function: array_contains

2015-07-21 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635425#comment-14635425
 ] 

Pedro Rodriguez commented on SPARK-8231:


I can give this one a shot since I already worked on size, which is somewhat 
similar.

 complex function: array_contains
 

 Key: SPARK-8231
 URL: https://issues.apache.org/jira/browse/SPARK-8231
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 array_contains(ArrayT, value)
 Returns TRUE if the array contains value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9122) spark.mllib regression should support batch predict

2015-07-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9122:
-
  Shepherd: Joseph K. Bradley
  Assignee: Yanbo Liang
Remaining Estimate: 72h
 Original Estimate: 72h

 spark.mllib regression should support batch predict
 ---

 Key: SPARK-9122
 URL: https://issues.apache.org/jira/browse/SPARK-9122
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
  Labels: starter
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, in spark.mllib, generalized linear regression models like 
 LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() 
 via: LinearRegressionModelBase.predict, which only takes single rows (feature 
 vectors).
 It should support batch prediction, taking an RDD.  (See other classes which 
 do this already such as NaiveBayesModel.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8481) GaussianMixtureModel predict accepting single vector

2015-07-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8481:
-
Assignee: Dariusz Kobylarz

 GaussianMixtureModel predict accepting single vector
 

 Key: SPARK-8481
 URL: https://issues.apache.org/jira/browse/SPARK-8481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Dariusz Kobylarz
Assignee: Dariusz Kobylarz
Priority: Minor
  Labels: GaussianMixtureModel, MLlib
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 GaussianMixtureModel lacks a method to predict a cluster for a single input 
 vector where no spark context would be involved, i.e.
 /** Maps given point to its cluster index. */
 def predict(point: Vector) : Int



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3157) Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates

2015-07-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635469#comment-14635469
 ] 

Joseph K. Bradley commented on SPARK-3157:
--

Good point, I'll close this.  Thanks!

 Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates
 -

 Key: SPARK-3157
 URL: https://issues.apache.org/jira/browse/SPARK-3157
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Improvement: computation, memory usage
 For ordered features, extractLeftRightNodeAggregates() computes pairs of 
 cumulative sums.  However, these sums are redundant since they are simply 
 cumulative sums accumulating from the left and right ends, respectively.  
 Only compute one sum.
 For unordered features, the left and right aggregates are essentially the 
 same data, copied from the original aggregates, but shifted by one index.  
 Avoid copying data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9224:
---

Assignee: (was: Apache Spark)

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9224:
---

Assignee: Apache Spark

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Apache Spark
Priority: Critical

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9154) Implement code generation for StringFormat

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635510#comment-14635510
 ] 

Apache Spark commented on SPARK-9154:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7570

 Implement code generation for StringFormat
 --

 Key: SPARK-9154
 URL: https://issues.apache.org/jira/browse/SPARK-9154
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9128) Get outerclasses and objects at the same time in ClosureCleaner

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-9128.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Get outerclasses and objects at the same time in ClosureCleaner
 ---

 Key: SPARK-9128
 URL: https://issues.apache.org/jira/browse/SPARK-9128
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.5.0


 Currently, in ClosureCleaner, the outerclasses and objects are retrieved 
 using two different methods. However, the logic of the two methods is the 
 same, and we can get both the outerclasses and objects with only one method 
 calling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7171) Allow for more flexible use of metric sources

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7171:
-
Assignee: Jacek Lewandowski

 Allow for more flexible use of metric sources
 -

 Key: SPARK-7171
 URL: https://issues.apache.org/jira/browse/SPARK-7171
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
Priority: Minor
 Fix For: 1.5.0


 With the current API, the user is allowed to add a custom metric source by 
 providing its class in metrics configuration. Metrics themselves are provided 
 by Codahale and therefore they allow to register multiple metrics in a single 
 source. Basically we can break the available types of metrics into two types: 
 push and pull - by push metrics I mean that some execution code updates 
 the metric by itself either periodically or every n events. On the other 
 hand, the pull metrics include some function which pulls the data from the 
 execution environment, when triggered. 
 h5.Problem
 The metric source is instantiated and registered during initialisation. Then, 
 the user has no way to access the instantiated object. It is also almost 
 impossible to access the execution environment of the current task. 
 Therefore, the user who wanted to provide his own {{RDD}} implementation 
 along with a dedicated metrics source, would find it very difficult to do 
 this in a safe, concise and elegant way.
 h5.Proposed solution
 At least, for the push metrics, it would be nice to be able to retrieve the 
 metrics source of particular type or with particular id from {{TaskContext}}. 
 It would allow custom tasks to update various metrics and would greatly 
 improve the usability of metrics.
 This could be achieved quite easily since {{TaskContext}} is created by 
 {{Executor}}, which has access to the metrics system, it would inject some 
 method to retrieve the particular metrics source. 
 This solution wouldn't change the current API, but just introduce one more 
 method in {{TaskContext}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7171) Allow for more flexible use of metric sources

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-7171.

  Resolution: Fixed
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Allow for more flexible use of metric sources
 -

 Key: SPARK-7171
 URL: https://issues.apache.org/jira/browse/SPARK-7171
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
Priority: Minor
 Fix For: 1.5.0


 With the current API, the user is allowed to add a custom metric source by 
 providing its class in metrics configuration. Metrics themselves are provided 
 by Codahale and therefore they allow to register multiple metrics in a single 
 source. Basically we can break the available types of metrics into two types: 
 push and pull - by push metrics I mean that some execution code updates 
 the metric by itself either periodically or every n events. On the other 
 hand, the pull metrics include some function which pulls the data from the 
 execution environment, when triggered. 
 h5.Problem
 The metric source is instantiated and registered during initialisation. Then, 
 the user has no way to access the instantiated object. It is also almost 
 impossible to access the execution environment of the current task. 
 Therefore, the user who wanted to provide his own {{RDD}} implementation 
 along with a dedicated metrics source, would find it very difficult to do 
 this in a safe, concise and elegant way.
 h5.Proposed solution
 At least, for the push metrics, it would be nice to be able to retrieve the 
 metrics source of particular type or with particular id from {{TaskContext}}. 
 It would allow custom tasks to update various metrics and would greatly 
 improve the usability of metrics.
 This could be achieved quite easily since {{TaskContext}} is created by 
 {{Executor}}, which has access to the metrics system, it would inject some 
 method to retrieve the particular metrics source. 
 This solution wouldn't change the current API, but just introduce one more 
 method in {{TaskContext}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5989) Model import/export for LDAModel

2015-07-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5989.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6948
[https://github.com/apache/spark/pull/6948]

 Model import/export for LDAModel
 

 Key: SPARK-5989
 URL: https://issues.apache.org/jira/browse/SPARK-5989
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
 Fix For: 1.5.0


 Add save/load for LDAModel and its local and distributed variants.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9183) NPE / confusing error message when looking up missing function in Spark SQL

2015-07-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635460#comment-14635460
 ] 

Reynold Xin commented on SPARK-9183:


Actually even that error message is bad - we should throw our own analysis 
exception here, not letting Hive throwing it.


 NPE / confusing error message when looking up missing function in Spark SQL
 ---

 Key: SPARK-9183
 URL: https://issues.apache.org/jira/browse/SPARK-9183
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.5.0
Reporter: Josh Rosen
Priority: Blocker

 Try running the following query in Spark Shell with Hive enabled:
 {code}
 sqlContext.sql(select substr(abc, 0, len(ab) - 1))
 {code}
 This query is malformed since there's no {{len}} UDF.  Unfortunately, though, 
 this gives a really confusing error as of Spark 1.4:
 {code}
 : java.lang.NullPointerException
   at 
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:643)
   at 
 org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
   at 
 org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
   at 
 org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:380)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
   at 
 org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:380)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 [...]
 {code}
 In Spark 1.3, on the other hand, this gives a helpful message:
 {code}
 : java.lang.RuntimeException: Couldn't find function len
   at scala.sys.package$.error(package.scala:27)
   at 
 org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$1.apply(hiveUdfs.scala:55)
   at 
 org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$1.apply(hiveUdfs.scala:55)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
   at 
 org.apache.spark.sql.hive.HiveContext$$anon$4.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:267)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:43)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:43)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:43)
   at 
 org.apache.spark.sql.hive.HiveContext$$anon$4.lookupFunction(HiveContext.scala:267)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:431)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$12$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:429)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188)
 

[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-21 Thread Robert Beauchemin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635480#comment-14635480
 ] 

Robert Beauchemin commented on SPARK-9078:
--

That was quick. Not sure that I have all the pieces in place for building right 
now, is it required? ;-) Currently I was just browsing the source code to 
figure out what would be required to add/use a new fully supported JDBC-based 
data source (how all the pieces worked) and came across the hardcoded SQL 
statement.  

 Use of non-standard LIMIT keyword in JDBC tableExists code
 --

 Key: SPARK-9078
 URL: https://issues.apache.org/jira/browse/SPARK-9078
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Robert Beauchemin
Priority: Minor

 tableExists in  
 spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
 non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
 table exists in a JDBC data source. This will cause an exception in many/most 
 JDBC databases that doesn't support LIMIT keyword. See 
 http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
 To check for table existence or an exception, it could be recrafted around 
 select 1 from $table where 0 = 1 which isn't the same (it returns an empty 
 resultset rather than the value '1'), but would support more data sources and 
 also support empty tables. Arguably ugly and possibly queries every row on 
 sources that don't support constant folding, but better than failing on JDBC 
 sources that don't support LIMIT. 
 Perhaps supports LIMIT could be a field in the JdbcDialect class for 
 databases that support keyword this to override. The ANSI standard is (OFFSET 
 and) FETCH. 
 The standard way to check for table existence would be to use 
 information_schema.tables which is a SQL standard but may not work for other 
 JDBC data sources that support SQL, but not the information_schema. The JDBC 
 DatabaseMetaData interface provides getSchemas()  that allows checking for 
 the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9024) Unsafe HashJoin

2015-07-21 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-9024:
-

Assignee: Davies Liu

 Unsafe HashJoin
 ---

 Key: SPARK-9024
 URL: https://issues.apache.org/jira/browse/SPARK-9024
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Create a version of BroadcastJoin that accepts UnsafeRow as inputs, and 
 outputs UnsafeRow as outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7105) Support model save/load in Python's GaussianMixture

2015-07-21 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635424#comment-14635424
 ] 

Manoj Kumar commented on SPARK-7105:


Hi, Are you still working on this?

 Support model save/load in Python's GaussianMixture
 ---

 Key: SPARK-7105
 URL: https://issues.apache.org/jira/browse/SPARK-7105
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Yu Ishikawa
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9223) Support model save/load in Python's LDA

2015-07-21 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9223:
--

 Summary: Support model save/load in Python's LDA
 Key: SPARK-9223
 URL: https://issues.apache.org/jira/browse/SPARK-9223
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-21 Thread Robert Beauchemin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635459#comment-14635459
 ] 

Robert Beauchemin commented on SPARK-9078:
--

Great, I didn't realize that JdbcDialects.registerDialect was a public API, 
passing it through to the jdbc data source would do it. 

Cheers, and thanks, Bob

 Use of non-standard LIMIT keyword in JDBC tableExists code
 --

 Key: SPARK-9078
 URL: https://issues.apache.org/jira/browse/SPARK-9078
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Robert Beauchemin
Priority: Minor

 tableExists in  
 spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
 non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
 table exists in a JDBC data source. This will cause an exception in many/most 
 JDBC databases that doesn't support LIMIT keyword. See 
 http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
 To check for table existence or an exception, it could be recrafted around 
 select 1 from $table where 0 = 1 which isn't the same (it returns an empty 
 resultset rather than the value '1'), but would support more data sources and 
 also support empty tables. Arguably ugly and possibly queries every row on 
 sources that don't support constant folding, but better than failing on JDBC 
 sources that don't support LIMIT. 
 Perhaps supports LIMIT could be a field in the JdbcDialect class for 
 databases that support keyword this to override. The ANSI standard is (OFFSET 
 and) FETCH. 
 The standard way to check for table existence would be to use 
 information_schema.tables which is a SQL standard but may not work for other 
 JDBC data sources that support SQL, but not the information_schema. The JDBC 
 DatabaseMetaData interface provides getSchemas()  that allows checking for 
 the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-21 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9224:


 Summary: OnlineLDAOptimizer Performance Improvements
 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical


OnlineLDAOptimizer's current implementation can be improved by using in-place 
updating (instead of reassignment to vars), reducing number of transpositions, 
and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9225) LDASuite needs unit tests for empty documents

2015-07-21 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9225:


 Summary: LDASuite needs unit tests for empty documents
 Key: SPARK-9225
 URL: https://issues.apache.org/jira/browse/SPARK-9225
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Feynman Liang
Priority: Minor


We need to add a unit test to {{LDASuite}} which check that empty documents are 
handled appropriately without crashing. This would require defining an empty 
corpus within {{LDASuite}} and adding tests for the available LDA optimizers 
(currently EM and Online). Note that only {{SparseVector}}s can be empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-21 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-9224:
-
  Priority: Major  (was: Critical)
Issue Type: Improvement  (was: Bug)

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Feynman Liang

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4598) Paginate stage page to avoid OOM with 100,000 tasks

2015-07-21 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4598.

  Resolution: Fixed
Assignee: Shixiong Zhu
   Fix Version/s: 1.5.0
Target Version/s: 1.5.0

 Paginate stage page to avoid OOM with  100,000 tasks
 -

 Key: SPARK-4598
 URL: https://issues.apache.org/jira/browse/SPARK-4598
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Affects Versions: 1.2.0
Reporter: meiyoula
Assignee: Shixiong Zhu
 Fix For: 1.5.0


 In HistoryServer stage page, clicking the task href in Description, it occurs 
 the GC error. The detail error message is:
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-352] | Error for 
 /history/application_1416206401491_0010/stages/stage/ | 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:590)
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 2014-11-17 16:36:30,851 | WARN  | [qtp1083955615-364] | handle failed | 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:697)
 java.lang.OutOfMemoryError: GC overhead limit exceeded



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9165) Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct

2015-07-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-9165:

Shepherd: Michael Armbrust

 Implement code generation for CreateArray, CreateStruct, and CreateNamedStruct
 --

 Key: SPARK-9165
 URL: https://issues.apache.org/jira/browse/SPARK-9165
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Yijie Shen





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering

2015-07-21 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-9222:
--

 Summary: Make class instantiation variables in DistributedLDAModel 
[private] clustering
 Key: SPARK-9222
 URL: https://issues.apache.org/jira/browse/SPARK-9222
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Manoj Kumar
Priority: Minor


This would enable testing the various class variables like docConcentration, 
topicConcentration etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635461#comment-14635461
 ] 

Reynold Xin commented on SPARK-9078:


Want to submit a pull request? :)

 Use of non-standard LIMIT keyword in JDBC tableExists code
 --

 Key: SPARK-9078
 URL: https://issues.apache.org/jira/browse/SPARK-9078
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Robert Beauchemin
Priority: Minor

 tableExists in  
 spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
 non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
 table exists in a JDBC data source. This will cause an exception in many/most 
 JDBC databases that doesn't support LIMIT keyword. See 
 http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
 To check for table existence or an exception, it could be recrafted around 
 select 1 from $table where 0 = 1 which isn't the same (it returns an empty 
 resultset rather than the value '1'), but would support more data sources and 
 also support empty tables. Arguably ugly and possibly queries every row on 
 sources that don't support constant folding, but better than failing on JDBC 
 sources that don't support LIMIT. 
 Perhaps supports LIMIT could be a field in the JdbcDialect class for 
 databases that support keyword this to override. The ANSI standard is (OFFSET 
 and) FETCH. 
 The standard way to check for table existence would be to use 
 information_schema.tables which is a SQL standard but may not work for other 
 JDBC data sources that support SQL, but not the information_schema. The JDBC 
 DatabaseMetaData interface provides getSchemas()  that allows checking for 
 the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9154) Implement code generation for StringFormat

2015-07-21 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-9154:
-

Reopening since this broke the build.

 Implement code generation for StringFormat
 --

 Key: SPARK-9154
 URL: https://issues.apache.org/jira/browse/SPARK-9154
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9122) spark.mllib regression should support batch predict

2015-07-21 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635454#comment-14635454
 ] 

Joseph K. Bradley commented on SPARK-9122:
--

OK thank you!

 spark.mllib regression should support batch predict
 ---

 Key: SPARK-9122
 URL: https://issues.apache.org/jira/browse/SPARK-9122
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
  Labels: starter

 Currently, in spark.mllib, generalized linear regression models like 
 LinearRegressionModel, RidgeRegressionModel and LassoModel support predict() 
 via: LinearRegressionModelBase.predict, which only takes single rows (feature 
 vectors).
 It should support batch prediction, taking an RDD.  (See other classes which 
 do this already such as NaiveBayesModel.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8481) GaussianMixtureModel predict accepting single vector

2015-07-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-8481.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6906
[https://github.com/apache/spark/pull/6906]

 GaussianMixtureModel predict accepting single vector
 

 Key: SPARK-8481
 URL: https://issues.apache.org/jira/browse/SPARK-8481
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Dariusz Kobylarz
Priority: Minor
  Labels: GaussianMixtureModel, MLlib
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 GaussianMixtureModel lacks a method to predict a cluster for a single input 
 vector where no spark context would be involved, i.e.
 /** Maps given point to its cluster index. */
 def predict(point: Vector) : Int



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3157) Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates

2015-07-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-3157.

   Resolution: Fixed
 Assignee: Joseph K. Bradley
Fix Version/s: 1.2.0

 Avoid duplicated stats in DecisionTree extractLeftRightNodeAggregates
 -

 Key: SPARK-3157
 URL: https://issues.apache.org/jira/browse/SPARK-3157
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.2.0


 Improvement: computation, memory usage
 For ordered features, extractLeftRightNodeAggregates() computes pairs of 
 cumulative sums.  However, these sums are redundant since they are simply 
 cumulative sums accumulating from the left and right ends, respectively.  
 Only compute one sum.
 For unordered features, the left and right aggregates are essentially the 
 same data, copied from the original aggregates, but shifted by one index.  
 Avoid copying data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9224) OnlineLDAOptimizer Performance Improvements

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635496#comment-14635496
 ] 

Apache Spark commented on SPARK-9224:
-

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7454

 OnlineLDAOptimizer Performance Improvements
 ---

 Key: SPARK-9224
 URL: https://issues.apache.org/jira/browse/SPARK-9224
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 OnlineLDAOptimizer's current implementation can be improved by using in-place 
 updating (instead of reassignment to vars), reducing number of 
 transpositions, and an outer product (instead of looping) to collect stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636206#comment-14636206
 ] 

Apache Spark commented on SPARK-9238:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/7582

 two extra useless entries for bytesOfCodePointInUTF8
 

 Key: SPARK-9238
 URL: https://issues.apache.org/jira/browse/SPARK-9238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li
Priority: Trivial

 Only a trial thing, not sure if I understand correctly or not but I guess 
 only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes 
 codepoint(110x) is enough.
 Details can be found from https://en.wikipedia.org/wiki/UTF-8 in 
 Description section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8

2015-07-21 Thread zhichao-li (JIRA)
zhichao-li created SPARK-9238:
-

 Summary: two extra useless entries for bytesOfCodePointInUTF8
 Key: SPARK-9238
 URL: https://issues.apache.org/jira/browse/SPARK-9238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li
Priority: Trivial


Only a trial thing, not sure if I understand correctly or not but I guess only 
2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes codepoint(110x) 
is enough.
Details can be found from https://en.wikipedia.org/wiki/UTF-8 in Description 
section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9240) Hybrid aggregate operator

2015-07-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9240:
---

 Summary: Hybrid aggregate operator
 Key: SPARK-9240
 URL: https://issues.apache.org/jira/browse/SPARK-9240
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


We need a hybrid aggregate operator, which first tries hash-based aggregations 
and gracefully switch to sort-based aggregations if the hash map's memory 
footprint exceeds a given threshold (how to track memory footprint and how to 
set the threshold?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9240) Hybrid aggregate operator using unsafe row

2015-07-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9240:

Summary: Hybrid aggregate operator using unsafe row  (was: Hybrid aggregate 
operator)

 Hybrid aggregate operator using unsafe row
 --

 Key: SPARK-9240
 URL: https://issues.apache.org/jira/browse/SPARK-9240
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 We need a hybrid aggregate operator, which first tries hash-based 
 aggregations and gracefully switch to sort-based aggregations if the hash 
 map's memory footprint exceeds a given threshold (how to track memory 
 footprint and how to set the threshold?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9244) Increase some default memory limits

2015-07-21 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-9244:


 Summary: Increase some default memory limits
 Key: SPARK-9244
 URL: https://issues.apache.org/jira/browse/SPARK-9244
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia


There are a few memory limits that people hit often and that we could make 
higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map output 
statuses in large shuffles. AFAIK the memory is not fully allocated up-front, 
so we can just make this larger and still not affect jobs that never sent a 
status that large.

- spark.executor.memory: Defaults at 512m, which is really small. We can at 
least increase it to 1g, though this is something users do need to set on their 
own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9053) Fix spaces around parens, infix operators etc.

2015-07-21 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636060#comment-14636060
 ] 

Shivaram Venkataraman commented on SPARK-9053:
--

Yeah - there are a bunch of real issues to be fixed first and we can discuss 
the ignore rule after that. Also I don't think we should ignore all warnings of 
this form -- just say on the `^` operator or we can mark out portions of the 
code that need to be ignored etc.

 Fix spaces around parens, infix operators etc.
 --

 Key: SPARK-9053
 URL: https://issues.apache.org/jira/browse/SPARK-9053
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 We have a number of style errors which look like 
 {code}
 Place a space before left parenthesis
 ...
 Put spaces around all infix operators.
 {code}
 However some of the warnings are spurious (example space around infix 
 operator in
 {code}
 expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], 
 sqrt(4^2 + 8^2))
 {code}
 We should add a ignore rule for these spurious examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8641) Native Spark Window Functions

2015-07-21 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-8641:
-
Description: 
*Rationale*
The window operator currently uses Hive UDAFs for all aggregation operations. 
This is fine in terms of performance and functionality. However they limit 
extensibility, and they are quite opaque in terms of processing and memory 
usage. The later blocks advanced optimizations such as code generation and 
tungsten style (advanced) memory management.

*Requirements*
We want to adress this by replacing the Hive UDAFs with native Spark SQL UDAFs. 
A redesign of the Spark UDAFs is currently underway, see SPARK-4366. The new 
window UDAFs should use this new standard, in order to make them as future 
proof as possible. Although we are replacing the standard Hive UDAFs, other 
existing Hive UDAFs should still be supported.

The new window UDAFs should, at least, cover all existing Hive standard window 
UDAFs:
# FIRST_VALUE
# LAST_VALUE
# LEAD
# LAG
# ROW_NUMBER
# RANK
# DENSE_RANK
# PERCENT_RANK
# NTILE
# CUME_DIST

All these function imply a row order; this means that in order to use these 
functions properly an
ORDER BY clause must be defined.

The first and last value UDAFs are already present in Spark SQL. The only thing 
which needs to be added is skip NULL functionality.

LEAD and LAG are not aggregates. These expressions return the value of an 
expression a number of rows before (LAG) or ahead (LEAD) of the current row. 
These expression put a constraint on the Window frame in which they are 
executed: this can only be a Row frame with equal offsets.

The ROW_NUMBER() function can be seen as a count in a running row frame (ROWS 
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).

RANK(), DENSE_RANK(), PERCENT_RANK(), NTILE(..)  CUME_DIST() are dependent on 
the actual value of values in the ORDER BY clause. The ORDER BY expression(s) 
must be made available before these functions are evaluated. All these 
functions will have a fixed frame, but this will be dependent on the 
implementation (probably a running row frame).

PERCENT_RANK(), NTILE(..)  CUME_DIST() are also dependent on the size of the 
partition being evaluated. The partition size must either be made available 
during evaluation (this is perfectly feasible in the current implementation) or 
the expression must be divided over two window and a merging expression, for 
instance PERCENT_RANK() would look like this:
{noformat}
(RANK() OVER (PARTITION BY x ORDER BY y) - 1) / (COUNT(*) OVER (PARTITION BY x) 
- 1)
{noformat}

*Design*
The old WindowFunction interface will be replaced by the following 
(initial/very early) design (including sub-classes):
{noformat}
/**
 * A window function is a function that can only be evaluated in the context of 
a window operator.
 */
trait WindowFunction {
  self: Expression =

  /**
   * Define the frame in which the window operator must be executed.
   */
  def frame: WindowFrame = UnspecifiedFrame
}

/**
 * Base class for LEAD/LAG offset window functions.
 *
 * These are ordinary expressions, the idea is that the Window operator will 
process these in a
 * separate (specialized) window frame.
 */
abstract class OffsetWindowFunction(val child: Expression, val offset: Int, val 
default: Expression) {
  override def deterministic: Boolean = false
  ...
}

case class Lead(child: Expression, offset: Int, default: Expression) extends 
OffsetWindowFunction(child, offset, default) {
  override val frame = SpecifiedWindowFrame(RowFrame, ValuePreceding(offset), 
ValuePreceding(offset))

  ...
}

case class Lag(child: Expression, offset: Int, default: Expression) extends 
OffsetWindowFunction(child, offset, default) {
  override val frame = SpecifiedWindowFrame(RowFrame, ValueFollowing(offset), 
ValueFollowing(offset))

  ...
}

case class RowNumber() extends AlgebraicAggregate with WindowFunction {
  override def deterministic: Boolean = false
  override val frame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, 
CurrentRow)
  ...
}

abstact class RankLike(val order: Seq[Expression] = Nil) extends 
AlgebraicAggregate with WindowFunction {
  override def deterministic: Boolean = true

  // This can be injected by either the Planner or the Window operator.
  def withOrderSpec(orderSpec: Seq[Expression]): AggregateWindowFuntion

  // This will be injected by the Window operator.
  // Only needed by: PERCENT_RANK(), NTILE(..)  CUME_DIST(). Maybe put this in 
a subclass.
  def withPartitionSize(size: MutableLiteral): AggregateWindowFuntion

  // We can do this as long as partition size is available before execution...
  override val frame = SpecifiedWindowFrame(RowFrame, UnboundedPreceding, 
CurrentRow)
  ...
}

case class Rank(order: Seq[Expression] = Nil) extends RankLike(order) {
  ...
}

case class DenseRank(order: Seq[Expression] = Nil) extends RankLike(order) {
  ...
}

case class 

[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7368:
-
Shepherd: Xiangrui Meng

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7368:
-
Assignee: yuhao yang

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: yuhao yang
Assignee: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7368) add QR decomposition for RowMatrix

2015-07-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7368:
-
Issue Type: New Feature  (was: Improvement)

 add QR decomposition for RowMatrix
 --

 Key: SPARK-7368
 URL: https://issues.apache.org/jira/browse/SPARK-7368
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: yuhao yang
Assignee: yuhao yang
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add QR decomposition for RowMatrix.
 There's a great distributed algorithm for QR decomposition, which I'm 
 currently referring to.
 Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations 
 for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE 
 International Conference on Big Data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-07-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9239:
---

 Summary: HiveUDAF support for AggregateFunction2
 Key: SPARK-9239
 URL: https://issues.apache.org/jira/browse/SPARK-9239
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


We need to build a wrapper for Hive UDAFs on top of AggregateFunction2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9053) Fix spaces around parens, infix operators etc.

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636321#comment-14636321
 ] 

Apache Spark commented on SPARK-9053:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/7584

 Fix spaces around parens, infix operators etc.
 --

 Key: SPARK-9053
 URL: https://issues.apache.org/jira/browse/SPARK-9053
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 We have a number of style errors which look like 
 {code}
 Place a space before left parenthesis
 ...
 Put spaces around all infix operators.
 {code}
 However some of the warnings are spurious (example space around infix 
 operator in
 {code}
 expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], 
 sqrt(4^2 + 8^2))
 {code}
 We should add a ignore rule for these spurious examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9053) Fix spaces around parens, infix operators etc.

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9053:
---

Assignee: (was: Apache Spark)

 Fix spaces around parens, infix operators etc.
 --

 Key: SPARK-9053
 URL: https://issues.apache.org/jira/browse/SPARK-9053
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman

 We have a number of style errors which look like 
 {code}
 Place a space before left parenthesis
 ...
 Put spaces around all infix operators.
 {code}
 However some of the warnings are spurious (example space around infix 
 operator in
 {code}
 expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], 
 sqrt(4^2 + 8^2))
 {code}
 We should add a ignore rule for these spurious examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR

2015-07-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9121.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7567
[https://github.com/apache/spark/pull/7567]

 Get rid of the warnings about `no visible global function definition` in 
 SparkR
 ---

 Key: SPARK-9121
 URL: https://issues.apache.org/jira/browse/SPARK-9121
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
 Fix For: 1.5.0


 We have a lot of warnings about {{no visible global function definition}} in 
 SparkR. So we should get rid of them.
 {noformat}
 R/utils.R:513:5: warning: no visible global function definition for 
 ‘processClosure’
 processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv)
 ^~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9121) Get rid of the warnings about `no visible global function definition` in SparkR

2015-07-21 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9121:
-
Assignee: Yu Ishikawa

 Get rid of the warnings about `no visible global function definition` in 
 SparkR
 ---

 Key: SPARK-9121
 URL: https://issues.apache.org/jira/browse/SPARK-9121
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa
Assignee: Yu Ishikawa
 Fix For: 1.5.0


 We have a lot of warnings about {{no visible global function definition}} in 
 SparkR. So we should get rid of them.
 {noformat}
 R/utils.R:513:5: warning: no visible global function definition for 
 ‘processClosure’
 processClosure(func.body, oldEnv, defVars, checkedFuncs, newEnv)
 ^~
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9236) Left Outer Join with empty JavaPairRDD returns empty RDD

2015-07-21 Thread Vitalii Slobodianyk (JIRA)
Vitalii Slobodianyk created SPARK-9236:
--

 Summary: Left Outer Join with empty JavaPairRDD returns empty RDD
 Key: SPARK-9236
 URL: https://issues.apache.org/jira/browse/SPARK-9236
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.1, 1.3.1
Reporter: Vitalii Slobodianyk


When the *left outer join* is performed on a non-empty {{JavaPairRDD}} with a 
{{JavaPairRDD}} which was created with the {{emptyRDD()}} method the resulting 
RDD is empty. In the following unit test the latest assert fails.

{code}
import static org.assertj.core.api.Assertions.assertThat;

import java.util.Collections;

import lombok.val;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.junit.Test;

import scala.Tuple2;

public class SparkTest {

  @Test
  public void joinEmptyRDDTest() {
val sparkConf = new SparkConf().setAppName(test).setMaster(local);

try (val sparkContext = new JavaSparkContext(sparkConf)) {
  val oneRdd = sparkContext.parallelize(Collections.singletonList(one));
  val twoRdd = sparkContext.parallelize(Collections.singletonList(two));
  val threeRdd = sparkContext.emptyRDD();

  val onePair = oneRdd.mapToPair(t - new Tuple2Integer, String(1, t));
  val twoPair = twoRdd.groupBy(t - 1);
  val threePair = threeRdd.groupBy(t - 1);

  assertThat(onePair.leftOuterJoin(twoPair).collect()).isNotEmpty();
  assertThat(onePair.leftOuterJoin(threePair).collect()).isNotEmpty();
}
  }

}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9238:
---

Assignee: (was: Apache Spark)

 two extra useless entries for bytesOfCodePointInUTF8
 

 Key: SPARK-9238
 URL: https://issues.apache.org/jira/browse/SPARK-9238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li
Priority: Trivial

 Only a trial thing, not sure if I understand correctly or not but I guess 
 only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes 
 codepoint(110x) is enough.
 Details can be found from https://en.wikipedia.org/wiki/UTF-8 in 
 Description section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9238) two extra useless entries for bytesOfCodePointInUTF8

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9238:
---

Assignee: Apache Spark

 two extra useless entries for bytesOfCodePointInUTF8
 

 Key: SPARK-9238
 URL: https://issues.apache.org/jira/browse/SPARK-9238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li
Assignee: Apache Spark
Priority: Trivial

 Only a trial thing, not sure if I understand correctly or not but I guess 
 only 2 entries in bytesOfCodePointInUTF8 for the case of 6 bytes 
 codepoint(110x) is enough.
 Details can be found from https://en.wikipedia.org/wiki/UTF-8 in 
 Description section.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8232) complex function: sort_array

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636146#comment-14636146
 ] 

Apache Spark commented on SPARK-8232:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7581

 complex function: sort_array
 

 Key: SPARK-8232
 URL: https://issues.apache.org/jira/browse/SPARK-8232
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 sort_array(ArrayT)
 Sorts the input array in ascending order according to the natural ordering of 
 the array elements and returns it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8232) complex function: sort_array

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8232:
---

Assignee: Apache Spark  (was: Cheng Hao)

 complex function: sort_array
 

 Key: SPARK-8232
 URL: https://issues.apache.org/jira/browse/SPARK-8232
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 sort_array(ArrayT)
 Sorts the input array in ascending order according to the natural ordering of 
 the array elements and returns it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8232) complex function: sort_array

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8232:
---

Assignee: Cheng Hao  (was: Apache Spark)

 complex function: sort_array
 

 Key: SPARK-8232
 URL: https://issues.apache.org/jira/browse/SPARK-8232
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 sort_array(ArrayT)
 Sorts the input array in ascending order according to the natural ordering of 
 the array elements and returns it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4366) Aggregation Improvement

2015-07-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4366:

Priority: Critical  (was: Major)

 Aggregation Improvement
 ---

 Key: SPARK-4366
 URL: https://issues.apache.org/jira/browse/SPARK-4366
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Critical
 Attachments: aggregatefunction_v1.pdf


 This improvement actually includes couple of sub tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9053) Fix spaces around parens, infix operators etc.

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9053:
---

Assignee: Apache Spark

 Fix spaces around parens, infix operators etc.
 --

 Key: SPARK-9053
 URL: https://issues.apache.org/jira/browse/SPARK-9053
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Assignee: Apache Spark

 We have a number of style errors which look like 
 {code}
 Place a space before left parenthesis
 ...
 Put spaces around all infix operators.
 {code}
 However some of the warnings are spurious (example space around infix 
 operator in
 {code}
 expect_equal(collect(select(df, hypot(df$a, df$b)))[4, HYPOT(a, b)], 
 sqrt(4^2 + 8^2))
 {code}
 We should add a ignore rule for these spurious examples



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9210) checkValidAggregateExpression() throws exceptions with bad error messages

2015-07-21 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14636115#comment-14636115
 ] 

Simeon Simeonov commented on SPARK-9210:


Standalone test demonstrating the problem  spark-shell output: 
https://gist.github.com/ssimeonov/72c8a9b01f99e35ba470

 checkValidAggregateExpression() throws exceptions with bad error messages
 -

 Key: SPARK-9210
 URL: https://issues.apache.org/jira/browse/SPARK-9210
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: N/A
Reporter: Simeon Simeonov
Priority: Trivial

 When a result column in {{SELECT ... GROUP BY}} is neither one of the {{GROUP 
 BY}} expressions nor uses an aggregation function, 
 {{org.apache.spark.sql.catalyst.analysis.CheckAnalysis}} throws 
 {{org.apache.spark.sql.AnalysisException}} with the message expression 
 '_column expression_' is neither present in the group by, nor is it an 
 aggregate function. Add to group by or wrap in first() if you don't care 
 which value you get.
 The remedy suggestion in the exception message incorrect: the function name 
 is {{first_value}}, not {{first}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4367) Partial aggregation support the DISTINCT aggregation

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4367:
---

Assignee: (was: Apache Spark)

 Partial aggregation support the DISTINCT aggregation
 

 Key: SPARK-4367
 URL: https://issues.apache.org/jira/browse/SPARK-4367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Most of aggregate function(e.g average) with distinct value will requires 
 all of the records in the same group to be shuffled into a single node, 
 however, as part of the optimization, those records can be partially 
 aggregated before shuffling, that probably reduces the overhead of shuffling 
 significantly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3056) Sort-based Aggregation

2015-07-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3056:
---

Assignee: (was: Apache Spark)

 Sort-based Aggregation
 --

 Key: SPARK-3056
 URL: https://issues.apache.org/jira/browse/SPARK-3056
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, SparkSQL only support the hash-based aggregation, which may cause 
 OOM if too many identical keys in the input tuples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9227) Add option to set logging level in Spark Context Constructor

2015-07-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635655#comment-14635655
 ] 

Sean Owen commented on SPARK-9227:
--

Why? there's already logging framework APIs for this, both config-driven and 
programmatic.

 Add option to set logging level in Spark Context Constructor
 

 Key: SPARK-9227
 URL: https://issues.apache.org/jira/browse/SPARK-9227
 Project: Spark
  Issue Type: Wish
Reporter: Auberon López
Priority: Minor

 It would be nice to be able to set the logging level in the constructor of a 
 Spark Context. This provides a cleaner interface than needing to call 
 setLoggingLevel after the context is already created. It would be especially 
 helpful in a REPL environment where logging can clutter up the terminal and 
 make it confusing for the user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9227) Add option to set logging level in Spark Context Constructor

2015-07-21 Thread JIRA
Auberon López created SPARK-9227:


 Summary: Add option to set logging level in Spark Context 
Constructor
 Key: SPARK-9227
 URL: https://issues.apache.org/jira/browse/SPARK-9227
 Project: Spark
  Issue Type: Wish
Reporter: Auberon López
Priority: Minor


It would be nice to be able to set the logging level in the constructor of a 
Spark Context. This provides a cleaner interface than needing to call 
setLoggingLevel after the context is already created. It would be especially 
helpful in a REPL environment where logging can clutter up the terminal and 
make it confusing for the user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9131) UDFs change data values

2015-07-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635690#comment-14635690
 ] 

Reynold Xin commented on SPARK-9131:


I see. Even if we have a relatively large dataset, as long as we can reproduce 
it, it'd be great to have.


 UDFs change data values
 ---

 Key: SPARK-9131
 URL: https://issues.apache.org/jira/browse/SPARK-9131
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0, 1.4.1
 Environment: Pyspark 1.4 and 1.4.1
Reporter: Luis Guerra
Priority: Critical

 I am having some troubles when using a custom udf in dataframes with pyspark 
 1.4.
 I have rewritten the udf to simplify the problem and it gets even weirder. 
 The udfs I am using do absolutely nothing, they just receive some value and 
 output the same value with the same format.
 I show you my code below:
 {code}
 c= a.join(b, a['ID'] == b['ID_new'], 'inner')
 c.filter(c['ID'] == '62698917').show()
 udf_A = UserDefinedFunction(lambda x: x, DateType())
 udf_B = UserDefinedFunction(lambda x: x, DateType())
 udf_C = UserDefinedFunction(lambda x: x, DateType())
 d = c.select(c['ID'], c['t1'].alias('ta'), 
 udf_A(vinc_muestra['t2']).alias('tb'), udf_B(vinc_muestra['t1']).alias('tc'), 
 udf_C(vinc_muestra['t2']).alias('td'))
 d.filter(d['ID'] == '62698917').show()
 {code}
 I am showing here the results from the outputs:
 {code}
 +++--+--+
 |  ID | ID_new  | t1   |   t2 |
 +++--+--+
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-28|   2014-02-28|
 |62698917|   62698917|   2012-02-20|   2013-02-20|
 +++--+--+
 ++---+---+++
 |   ID|   ta |   tb|   tc| td 
   |
 ++---+---+++
 |62698917| 2012-02-28|   2007-03-05|2003-03-05|
 2014-02-28|
 |62698917| 2012-02-20|   2007-02-15|2002-02-15|
 2013-02-20|
 |62698917| 2012-02-28|   2007-03-10|2005-03-10|
 2014-02-28|
 |62698917| 2012-02-20|   2007-03-05|2003-03-05|
 2013-02-20|
 |62698917| 2012-02-20|   2013-08-02|2013-01-02|
 2013-02-20|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-28|   2007-02-15|2002-02-15|
 2014-02-28|
 |62698917| 2012-02-20|   2014-01-02|2013-01-02|
 2013-02-20|
 ++---+---+++
 {code}
 The problem here is that values at columns 'tb', 'tc' and 'td' in dataframe 
 'd' are completely different from values 't1' and 't2' in dataframe c even 
 when my udfs are doing nothing. It seems like if values were somehow got from 
 other registers (or just invented). Results are different between executions 
 (apparently random).
 Thanks in advance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9229) pyspark yarn-cluster PYSPARK_PYTHON not set

2015-07-21 Thread Eric Kimbrel (JIRA)
Eric Kimbrel created SPARK-9229:
---

 Summary: pyspark yarn-cluster  PYSPARK_PYTHON not set
 Key: SPARK-9229
 URL: https://issues.apache.org/jira/browse/SPARK-9229
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
 Environment: centos 
Reporter: Eric Kimbrel


PYSPARK_PYTHON is set in spark-env.sh to use an alternative python installation.

Use spark-submit to run a pyspark job in yarn with cluster deploy mode.

PYSPARK_PTYHON is not set in the cluster environment, and the system default 
python is used instead of the intended original.

test code: (simple.py)

from pyspark import SparkConf, SparkContext
import sys,os
conf = SparkConf()
sc = SparkContext(conf=conf)
out = [('PYTHON VERSION',str(sys.version))]
out.extend( zip( os.environ.keys(),os.environ.values() ) )
rdd = sc.parallelize(out)
rdd.coalesce(1).saveAsTextFile(hdfs://namenode/tmp/env)

submit command:

spark-submit --master yarn --deploy-mode cluster --num-executors 1 simple.py 

I've also tried setting PYSPARK_PYTHON on the command line with no effect.

It seems like there is no way to specify an alternative python executable in 
yarn-cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9229) pyspark yarn-cluster PYSPARK_PYTHON not set

2015-07-21 Thread Eric Kimbrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Kimbrel updated SPARK-9229:

Environment: centos   Cloudera 5.4.1 based off Apache Hadoop 2.6.0, using 
spark 1.5.0 built for hadoop 2.6.0 from github master branch on 7.20.2015  
(was: centos )

 pyspark yarn-cluster  PYSPARK_PYTHON not set
 

 Key: SPARK-9229
 URL: https://issues.apache.org/jira/browse/SPARK-9229
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
 Environment: centos   Cloudera 5.4.1 based off Apache Hadoop 2.6.0, 
 using spark 1.5.0 built for hadoop 2.6.0 from github master branch on 
 7.20.2015
Reporter: Eric Kimbrel

 PYSPARK_PYTHON is set in spark-env.sh to use an alternative python 
 installation.
 Use spark-submit to run a pyspark job in yarn with cluster deploy mode.
 PYSPARK_PTYHON is not set in the cluster environment, and the system default 
 python is used instead of the intended original.
 test code: (simple.py)
 from pyspark import SparkConf, SparkContext
 import sys,os
 conf = SparkConf()
 sc = SparkContext(conf=conf)
 out = [('PYTHON VERSION',str(sys.version))]
 out.extend( zip( os.environ.keys(),os.environ.values() ) )
 rdd = sc.parallelize(out)
 rdd.coalesce(1).saveAsTextFile(hdfs://namenode/tmp/env)
 submit command:
 spark-submit --master yarn --deploy-mode cluster --num-executors 1 simple.py 
 I've also tried setting PYSPARK_PYTHON on the command line with no effect.
 It seems like there is no way to specify an alternative python executable in 
 yarn-cluster mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8357) Memory leakage on unsafe aggregation path with empty input

2015-07-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8357.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7560
[https://github.com/apache/spark/pull/7560]

 Memory leakage on unsafe aggregation path with empty input
 --

 Key: SPARK-8357
 URL: https://issues.apache.org/jira/browse/SPARK-8357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Navis
Assignee: Navis
Priority: Critical
 Fix For: 1.5.0


 Currently, unsafe-based hash is released on 'next' call but if input is 
 empty, it would not be called ever. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9226) Change default log level to WARN in python REPL

2015-07-21 Thread JIRA
Auberon López created SPARK-9226:


 Summary: Change default log level to WARN in python REPL
 Key: SPARK-9226
 URL: https://issues.apache.org/jira/browse/SPARK-9226
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Auberon López
Priority: Minor
 Fix For: 1.5.0


SPARK-7261 provides separate logging properties to be used when in the scala 
REPL, by default changing the logging level to WARN instead of INFO. This same 
improvement can be implemented for the Python REPL, which will make using 
Pyspark interactively a cleaner experience that is closer to parity with the 
scala shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9154) Implement code generation for StringFormat

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635618#comment-14635618
 ] 

Apache Spark commented on SPARK-9154:
-

User 'tarekauel' has created a pull request for this issue:
https://github.com/apache/spark/pull/7571

 Implement code generation for StringFormat
 --

 Key: SPARK-9154
 URL: https://issues.apache.org/jira/browse/SPARK-9154
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Tarek Auel
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9227) Add option to set logging level in Spark Context Constructor

2015-07-21 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635665#comment-14635665
 ] 

Marcelo Vanzin commented on SPARK-9227:
---

A programmatic API for this is overkill.

If you want to do something, I'd suggest making the log level in the default 
log4j config a variable, so that you can override it by setting a system 
property. No need for API changes to make that work. 

 Add option to set logging level in Spark Context Constructor
 

 Key: SPARK-9227
 URL: https://issues.apache.org/jira/browse/SPARK-9227
 Project: Spark
  Issue Type: Wish
Reporter: Auberon López
Priority: Minor

 It would be nice to be able to set the logging level in the constructor of a 
 Spark Context. This provides a cleaner interface than needing to call 
 setLoggingLevel after the context is already created. It would be especially 
 helpful in a REPL environment where logging can clutter up the terminal and 
 make it confusing for the user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9228) Adjust Spark SQL Configs

2015-07-21 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-9228:
---

 Summary: Adjust Spark SQL Configs
 Key: SPARK-9228
 URL: https://issues.apache.org/jira/browse/SPARK-9228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker


Before QA, lets flip on features and consolidate unsafe and codegen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8007) Support resolving virtual columns in DataFrames

2015-07-21 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635597#comment-14635597
 ] 

Michael Armbrust commented on SPARK-8007:
-

I'm going to propose that we don't change the analyzer, but instead just use 
functions for all the cases that were specified.  This is nice because we can 
never be ambiguous with a user column.


 Support resolving virtual columns in DataFrames
 ---

 Key: SPARK-8007
 URL: https://issues.apache.org/jira/browse/SPARK-8007
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Joseph Batchik

 Create the infrastructure so we can resolve df(SPARK__PARTITION__ID) to 
 SparkPartitionID expression.
 A cool use case is to understand physical data skew:
 {code}
 df.groupBy(SPARK__PARTITION__ID).count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9213) Improve regular expression performance (via joni)

2015-07-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9213:
---
Description: 
I'm creating an umbrella ticket to improve regular expression performance for 
string expressions. Right now our use of regular expressions is inefficient for 
two reasons:

1. Java regex in general is slow.
2. We have to convert everything from UTF8 encoded bytes into Java String, and 
then run regex on it, and then convert it back.

There are libraries in Java that provide regex support directly on UTF8 encoded 
bytes. One prominent example is joni, used in JRuby.]


Note: all regex functions are in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala






  was:
I'm creating an umbrella ticket to improve regular expression performance for 
string expressions. Right now our use of regular expressions is inefficient for 
two reasons:

1. Java regex in general is slow.
2. We have to convert everything from UTF8 encoded bytes into Java String, and 
then run regex on it, and then convert it back.

There are libraries in Java that provide regex support directly on UTF8 encoded 
bytes. One prominent example is joni, used in JRuby.]








 Improve regular expression performance (via joni)
 -

 Key: SPARK-9213
 URL: https://issues.apache.org/jira/browse/SPARK-9213
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Reporter: Reynold Xin

 I'm creating an umbrella ticket to improve regular expression performance for 
 string expressions. Right now our use of regular expressions is inefficient 
 for two reasons:
 1. Java regex in general is slow.
 2. We have to convert everything from UTF8 encoded bytes into Java String, 
 and then run regex on it, and then convert it back.
 There are libraries in Java that provide regex support directly on UTF8 
 encoded bytes. One prominent example is joni, used in JRuby.]
 Note: all regex functions are in 
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8103) DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635663#comment-14635663
 ] 

Apache Spark commented on SPARK-8103:
-

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/7572

 DAGScheduler should not launch multiple concurrent attempts for one stage on 
 fetch failures
 ---

 Key: SPARK-8103
 URL: https://issues.apache.org/jira/browse/SPARK-8103
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.4.0
Reporter: Imran Rashid
Assignee: Imran Rashid
 Fix For: 1.5.0


 When there is a fetch failure, {{DAGScheduler}} is supposed to fail the 
 stage, retry the necessary portions of the preceding shuffle stage which 
 generated the shuffle data, and eventually rerun the stage.  
 We generally expect to get multiple fetch failures together, but only want to 
 re-start the stage once.  The code already makes an attempt to address this 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1108
  .  
 {code}
// It is likely that we receive multiple FetchFailed for a single 
 stage (because we have
 // multiple tasks running concurrently on different executors). In 
 that case, it is possible
 // the fetch failure has already been handled by the scheduler.
 if (runningStages.contains(failedStage)) {
 {code}
 However, this logic is flawed because the stage may have been **resubmitted** 
 by the time we get these fetch failures.  In that case, 
 {{runningStages.contains(failedStage)}} will be true, but we've already 
 handled these failures.
 This results in multiple concurrent non-zombie attempts for one stage.  In 
 addition to being very confusing, and a waste of resources, this also can 
 lead to later stages being submitted before the previous stage has registered 
 its map output.  This happens because
 (a) when one attempt finishes all its tasks, it may not register its map 
 output because the stage still has pending tasks, from other attempts 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1046
 {code}
 if (runningStages.contains(shuffleStage)  
 shuffleStage.pendingTasks.isEmpty) {
 {code}
 and (b) {{submitStage}} thinks the following stage is ready to go, because 
 {{getMissingParentStages}} thinks the stage is complete as long it has all of 
 its map outputs: 
 https://github.com/apache/spark/blob/10ba1880878d0babcdc5c9b688df5458ea131531/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L397
 {code}
 if (!mapStage.isAvailable) {
   missing += mapStage
 }
 {code}
 So the following stage is submitted repeatedly, but it is doomed to fail 
 because its shuffle output has never been registered with the map output 
 tracker.  Here's an example failure in this case:
 {noformat}
 WARN TaskSetManager: Lost task 5.0 in stage 3.2 (TID 294, 192.168.1.104): 
 FetchFailed(null, shuffleId=0, mapId=-1, reduceId=5, message=
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing output 
 locations for shuffle ...
 {noformat}
 Note that this is a subset of the problems originally described in 
 SPARK-7308, limited to just the issues effecting the DAGScheduler



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9233) Enable code-gen in window function unit tests

2015-07-21 Thread Yin Huai (JIRA)
Yin Huai created SPARK-9233:
---

 Summary: Enable code-gen in window function unit tests
 Key: SPARK-9233
 URL: https://issues.apache.org/jira/browse/SPARK-9233
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai


Right now, our {{HiveWindowFunctionQuerySuite.scala}} set code-gen to false, 
since code-gen is enabled by default, we need to enable code-gen for tests in 
this file and fix bugs we find.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9222) Make class instantiation variables in DistributedLDAModel [private] clustering

2015-07-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14635729#comment-14635729
 ] 

Apache Spark commented on SPARK-9222:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/7573

 Make class instantiation variables in DistributedLDAModel [private] clustering
 --

 Key: SPARK-9222
 URL: https://issues.apache.org/jira/browse/SPARK-9222
 Project: Spark
  Issue Type: Test
  Components: MLlib
Reporter: Manoj Kumar
Priority: Minor

 This would enable testing the various class variables like docConcentration, 
 topicConcentration etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >