[jira] [Commented] (SPARK-2574) Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner

2014-07-19 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067815#comment-14067815
 ] 

Sandeep Singh commented on SPARK-2574:
--

[~sandyr] we can rewrite mergeCombiners as
(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => c1 ++= c2 ,
Instead of (c1: ArrayBuffer[V], c2: ArrayBuffer[V]) => c1 ++ c2

> Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner
> --
>
> Key: SPARK-2574
> URL: https://issues.apache.org/jira/browse/SPARK-2574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2597) Improve the code related to Table Scan

2014-07-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067814#comment-14067814
 ] 

Yin Huai commented on SPARK-2597:
-

Hive uses HiveInputFormat as the wrapper of different InputFormats. We may want 
to have a similar approach (HiveInputFormat cannot be used directly).

> Improve the code related to Table Scan
> --
>
> Key: SPARK-2597
> URL: https://issues.apache.org/jira/browse/SPARK-2597
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> There are a several issues with the current code related to Table Scan.
> 1. HadoopTableReader and HiveTableScan are used together to deal with Hive 
> tables. It is not clear why we do the Hive-specific work in two different 
> places.
> 2. HadoopTableReader creates a RDD for every Hive partition and then Union 
> these RDDs. Is it the right way to handle partitioned tables? 
> 3. Right now, we ship initializeLocalJobConfFunc to every task to set some 
> local properties. Can we avoid it?
> I think it will be good to improve the code related to Table Scan. Also, it 
> is important to make sure we do not introduce performance issues with the 
> proposed changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2597) Improve the code related to Table Scan

2014-07-19 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2597:
---

 Summary: Improve the code related to Table Scan
 Key: SPARK-2597
 URL: https://issues.apache.org/jira/browse/SPARK-2597
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


There are a several issues with the current code related to Table Scan.
1. HadoopTableReader and HiveTableScan are used together to deal with Hive 
tables. It is not clear why we do the Hive-specific work in two different 
places.
2. HadoopTableReader creates a RDD for every Hive partition and then Union 
these RDDs. Is it the right way to handle partitioned tables? 
3. Right now, we ship initializeLocalJobConfFunc to every task to set some 
local properties. Can we avoid it?

I think it will be good to improve the code related to Table Scan. Also, it is 
important to make sure we do not introduce performance issues with the proposed 
changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2524) missing document about spark.deploy.retainedDrivers

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2524.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1443
[https://github.com/apache/spark/pull/1443]

> missing document about spark.deploy.retainedDrivers
> ---
>
> Key: SPARK-2524
> URL: https://issues.apache.org/jira/browse/SPARK-2524
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Lianhui Wang
> Fix For: 1.1.0
>
>
> The configuration on spark.deploy.retainedDrivers is undocumented but 
> actually used
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2524) missing document about spark.deploy.retainedDrivers

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2524:
---

Assignee: Lianhui Wang

> missing document about spark.deploy.retainedDrivers
> ---
>
> Key: SPARK-2524
> URL: https://issues.apache.org/jira/browse/SPARK-2524
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Lianhui Wang
>Assignee: Lianhui Wang
> Fix For: 1.1.0
>
>
> The configuration on spark.deploy.retainedDrivers is undocumented but 
> actually used
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L60



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2587) Error message is incorrect in make-distribution.sh

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2587.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1489
[https://github.com/apache/spark/pull/1489]

> Error message is incorrect in make-distribution.sh
> --
>
> Key: SPARK-2587
> URL: https://issues.apache.org/jira/browse/SPARK-2587
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Mark Wagner
>Assignee: Mark Wagner
>Priority: Minor
> Fix For: 1.1.0
>
>
> SPARK-2526 removed some options in favor of using Maven profiles, but it now 
> gives incorrect guidance for those that try to use the old --with-hive flag: 
> "--with-hive' is no longer supported, use Maven option -Pyarn"



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2587) Error message is incorrect in make-distribution.sh

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2587:
---

Assignee: Mark Wagner

> Error message is incorrect in make-distribution.sh
> --
>
> Key: SPARK-2587
> URL: https://issues.apache.org/jira/browse/SPARK-2587
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Mark Wagner
>Assignee: Mark Wagner
>Priority: Minor
> Fix For: 1.1.0
>
>
> SPARK-2526 removed some options in favor of using Maven profiles, but it now 
> gives incorrect guidance for those that try to use the old --with-hive flag: 
> "--with-hive' is no longer supported, use Maven option -Pyarn"



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067765#comment-14067765
 ] 

Apache Spark commented on SPARK-2226:
-

User 'willb' has created a pull request for this issue:
https://github.com/apache/spark/pull/1497

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2596) Populate pull requests on JIRA automatically

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2596.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1496
[https://github.com/apache/spark/pull/1496]

> Populate pull requests on JIRA automatically
> 
>
> Key: SPARK-2596
> URL: https://issues.apache.org/jira/browse/SPARK-2596
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
> Fix For: 1.1.0
>
>
> For a bunch of reasons we should automatically populate a JIRA with 
> information about new pull requests when they arrive. I've written a small 
> python script to do this that we can run from Jenkins every 5 or 10 minutes 
> to keep things in Sync.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1682) Add gradient descent w/o sampling and RDA L1 updater

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067739#comment-14067739
 ] 

Apache Spark commented on SPARK-1682:
-

User 'dongwang218' has created a pull request for this issue:
https://github.com/apache/spark/pull/643

> Add gradient descent w/o sampling and RDA L1 updater
> 
>
> Key: SPARK-1682
> URL: https://issues.apache.org/jira/browse/SPARK-1682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Dong Wang
>
> The GradientDescent optimizer does sampling before a gradient step. When 
> input data is already shuffled beforehand, it is possible to scan data and 
> make gradient descent for each data instance. This could be potentially more 
> efficient.
> Add enhanced RDA L1 updater, which could produce even sparse solutions with 
> comparable quality compared with L1. Reference: 
> Lin Xiao, "Dual Averaging Methods for Regularized Stochastic Learning and 
> Online Optimization", Journal of Machine Learning Research 11 (2010) 
> 2543-2596.
> Small fix: add options to BinaryClassification example to read and write 
> model file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-2596) Populate pull requests on JIRA automatically

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2596:
---

Comment: was deleted

(was: This is a test:
http://google.com)

> Populate pull requests on JIRA automatically
> 
>
> Key: SPARK-2596
> URL: https://issues.apache.org/jira/browse/SPARK-2596
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> For a bunch of reasons we should automatically populate a JIRA with 
> information about new pull requests when they arrive. I've written a small 
> python script to do this that we can run from Jenkins every 5 or 10 minutes 
> to keep things in Sync.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2596) Populate pull requests on JIRA automatically

2014-07-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067731#comment-14067731
 ] 

Patrick Wendell commented on SPARK-2596:


This is a test:
http://google.com

> Populate pull requests on JIRA automatically
> 
>
> Key: SPARK-2596
> URL: https://issues.apache.org/jira/browse/SPARK-2596
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> For a bunch of reasons we should automatically populate a JIRA with 
> information about new pull requests when they arrive. I've written a small 
> python script to do this that we can run from Jenkins every 5 or 10 minutes 
> to keep things in Sync.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1022) Add unit tests for kafka streaming

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067730#comment-14067730
 ] 

Apache Spark commented on SPARK-1022:
-

User 'tdas' has created a pull request for this issue:
[https://github.com/apache/spark/pull/557|https://github.com/apache/spark/pull/557]

> Add unit tests for kafka streaming
> --
>
> Key: SPARK-1022
> URL: https://issues.apache.org/jira/browse/SPARK-1022
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Saisai Shao
>
> It would be nice if we could add unit tests to verify elements of kafka's 
> stream. Right now we do integration tests only which makes it hard to upgrade 
> versions of kafka. The place to start here would be to look at how kafka 
> tests itself and see if the functionality can be exposed to third party users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1630) PythonRDDs don't handle nulls gracefully

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067729#comment-14067729
 ] 

Apache Spark commented on SPARK-1630:
-

User 'kalpit' has created a pull request for this issue:
[https://github.com/apache/spark/pull/554|https://github.com/apache/spark/pull/554]

> PythonRDDs don't handle nulls gracefully
> 
>
> Key: SPARK-1630
> URL: https://issues.apache.org/jira/browse/SPARK-1630
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 0.9.1
>Reporter: Kalpit Shah
> Fix For: 1.1.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> If PythonRDDs receive a null element in iterators, they currently NPE. It 
> would be better do log a DEBUG message and skip the write of NULL elements.
> Here are the 2 stack traces :
> 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
> Thread[stdin writer for python,5,main]
> java.lang.NullPointerException
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
>   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
> -
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.writeToFile.
> : java.lang.NullPointerException
>   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
>   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
>   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
>   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1597) Add a version of reduceByKey that takes the Partitioner as a second argument

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067727#comment-14067727
 ] 

Apache Spark commented on SPARK-1597:
-

User 'techaddict' has created a pull request for this issue:
[https://github.com/apache/spark/pull/550|https://github.com/apache/spark/pull/550]

> Add a version of reduceByKey that takes the Partitioner as a second argument
> 
>
> Key: SPARK-1597
> URL: https://issues.apache.org/jira/browse/SPARK-1597
> Project: Spark
>  Issue Type: Bug
>Reporter: Matei Zaharia
>Assignee: Sandeep Singh
>Priority: Blocker
>
> Most of our shuffle methods can take a Partitioner or a number of partitions 
> as a second argument, but for some reason reduceByKey takes the Partitioner 
> as a *first* argument: 
> http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
>  We should deprecate that version and add one where the Partitioner is the 
> second argument.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1623) SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files by name

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067725#comment-14067725
 ] 

Apache Spark commented on SPARK-1623:
-

User 'nsuthar' has created a pull request for this issue:
[https://github.com/apache/spark/pull/546|https://github.com/apache/spark/pull/546]

> SPARK-1623. Broadcast cleaner should use getCanonicalPath when deleting files 
> by name
> -
>
> Key: SPARK-1623
> URL: https://issues.apache.org/jira/browse/SPARK-1623
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Niraj Suthar
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2596) Populate pull requests on JIRA automatically

2014-07-19 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-2596:
--

 Summary: Populate pull requests on JIRA automatically
 Key: SPARK-2596
 URL: https://issues.apache.org/jira/browse/SPARK-2596
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell


For a bunch of reasons we should automatically populate a JIRA with information 
about new pull requests when they arrive. I've written a small python script to 
do this that we can run from Jenkins every 5 or 10 minutes to keep things in 
Sync.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1795) Add recursive directory file search to fileInputStream

2014-07-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067722#comment-14067722
 ] 

Apache Spark commented on SPARK-1795:
-

User 'patrickotoole' has created a pull request for this issue:
[https://github.com/apache/spark/pull/537|https://github.com/apache/spark/pull/537]

> Add recursive directory file search to fileInputStream
> --
>
> Key: SPARK-1795
> URL: https://issues.apache.org/jira/browse/SPARK-1795
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Rick OToole
>
> When writing logs, they are often partitioned into a hierarchical directory 
> structure. This change will allow spark streaming to monitor all 
> sub-directories of a parent directory to find new files as they are added. 
> See https://github.com/apache/spark/pull/537



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1612) Potential resource leaks in Utils.copyStream and Utils.offsetBytes

2014-07-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067720#comment-14067720
 ] 

Patrick Wendell commented on SPARK-1612:


A pull request has been posted for this issue:
Author: zsxwing
URL: 
[https://github.com/apache/spark/pull/535|https://github.com/apache/spark/pull/535]

> Potential resource leaks in Utils.copyStream and Utils.offsetBytes
> --
>
> Key: SPARK-1612
> URL: https://issues.apache.org/jira/browse/SPARK-1612
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>  Labels: easyfix
>
> Should move the "close" statements into a "finally" block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Issue Comment Deleted] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner

2014-07-19 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1580:
---

Comment: was deleted

(was: A pull request has been posted for this issue:Author: 
tmyklebuURL: https://github.com/apache/spark/pull/493)

> ALS: Estimate communication and computation costs given a partitioner
> -
>
> Key: SPARK-1580
> URL: https://issues.apache.org/jira/browse/SPARK-1580
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tor Myklebust
>Priority: Minor
>
> It would be nice to be able to estimate the amount of work needed to solve an 
> ALS problem.  The chief components of this "work" are computation time---time 
> spent forming and solving the least squares problems---and communication 
> cost---the number of bytes sent across the network.  Communication cost 
> depends heavily on how the users and products are partitioned.
> We currently do not try to cluster users or products so that fewer feature 
> vectors need to be communicated.  This is intended as a first step toward 
> that end---we ought to be able to tell whether one partitioning is better 
> than another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1581) Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker

2014-07-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067719#comment-14067719
 ] 

Patrick Wendell commented on SPARK-1581:


A pull request has been posted for this issue:Author: christopheclcURL: 
https://github.com/apache/spark/pull/495

> Allow One Flume Avro RPC Server for Each Worker rather than Just One Worker
> ---
>
> Key: SPARK-1581
> URL: https://issues.apache.org/jira/browse/SPARK-1581
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Christophe Clapp
>Priority: Minor
>  Labels: Flume
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner

2014-07-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067718#comment-14067718
 ] 

Patrick Wendell commented on SPARK-1580:


A pull request has been posted for this issue:Author: tmyklebuURL: 
https://github.com/apache/spark/pull/493

> ALS: Estimate communication and computation costs given a partitioner
> -
>
> Key: SPARK-1580
> URL: https://issues.apache.org/jira/browse/SPARK-1580
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Tor Myklebust
>Priority: Minor
>
> It would be nice to be able to estimate the amount of work needed to solve an 
> ALS problem.  The chief components of this "work" are computation time---time 
> spent forming and solving the least squares problems---and communication 
> cost---the number of bytes sent across the network.  Communication cost 
> depends heavily on how the users and products are partitioned.
> We currently do not try to cluster users or products so that fewer feature 
> vectors need to be communicated.  This is intended as a first step toward 
> that end---we ought to be able to tell whether one partitioning is better 
> than another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-19 Thread Chris Fregly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067696#comment-14067696
 ] 

Chris Fregly commented on SPARK-1981:
-

[~pwendell]  is there anything i need to do within the spark_ec2 scripts to 
makes sure kinesis is built and/or enabled when EC2 instances are created?  i 
want to make sure i'm covering all the bases.

> Add AWS Kinesis streaming support
> -
>
> Key: SPARK-1981
> URL: https://issues.apache.org/jira/browse/SPARK-1981
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
>
> Add AWS Kinesis support to Spark Streaming.
> Initial discussion occured here:  https://github.com/apache/spark/pull/223
> I discussed this with Parviz from AWS recently and we agreed that I would 
> take this over.
> Look for a new PR that takes into account all the feedback from the earlier 
> PR including spark-1.0-compliant implementation, AWS-license-aware build 
> support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067653#comment-14067653
 ] 

Guoqiang Li commented on SPARK-2595:


Sorry I removed it.

> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2595:
---

Description: 
[SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
GC-based cleaning only consider the memory usage of the drive. We should 
consider more factors to trigger gc. eg: executor exit code, task exception, 
task gc time .



  was:
[SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
GC-based cleaning only consider the memory usage of the drive. We should 
consider more factors to trigger gc. eg: executor exit code, task exception, 
task gc time .

[~pwendell]'s proposal:
if we detect memory pressure on the executors we should try to trigger a GC on 
the driver so that if there happen to be RDD's that have gone out of scope on 
the driver side, their associated cache blocks will be cleaned up on executors 
and free up memory.


> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067653#comment-14067653
 ] 

Guoqiang Li edited comment on SPARK-2595 at 7/19/14 7:45 PM:
-

Sorry, I removed it.


was (Author: gq):
Sorry I removed it.

> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067651#comment-14067651
 ] 

Patrick Wendell commented on SPARK-2595:


I was not proposing that we should do this. I was just attempting to summarize 
what the existing patch does.

> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .
> [~pwendell]'s proposal:
> if we detect memory pressure on the executors we should try to trigger a GC 
> on the driver so that if there happen to be RDD's that have gone out of scope 
> on the driver side, their associated cache blocks will be cleaned up on 
> executors and free up memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2595:
---

Description: 
[SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
GC-based cleaning only consider the memory usage of the drive. We should 
consider more factors to trigger gc. eg: executor exit code, task exception, 
task gc time .

[~pwendell]'s proposal:
if we detect memory pressure on the executors we should try to trigger a GC on 
the driver so that if there happen to be RDD's that have gone out of scope on 
the driver side, their associated cache blocks will be cleaned up on executors 
and free up memory.

  was:
[SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
GC-based cleaning only consider the memory usage of the drive. We should 
consider more factors to trigger gc.includes executor exit code, task 
exception, task gc time .

[~pwendell]'s proposal:
if we detect memory pressure on the executors we should try to trigger a GC on 
the driver so that if there happen to be RDD's that have gone out of scope on 
the driver side, their associated cache blocks will be cleaned up on executors 
and free up memory.


> The driver run garbage collection, when the executor throws OutOfMemoryError 
> exception
> --
>
> Key: SPARK-2595
> URL: https://issues.apache.org/jira/browse/SPARK-2595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> [SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
> GC-based cleaning only consider the memory usage of the drive. We should 
> consider more factors to trigger gc. eg: executor exit code, task exception, 
> task gc time .
> [~pwendell]'s proposal:
> if we detect memory pressure on the executors we should try to trigger a GC 
> on the driver so that if there happen to be RDD's that have gone out of scope 
> on the driver side, their associated cache blocks will be cleaned up on 
> executors and free up memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2595) The driver run garbage collection, when the executor throws OutOfMemoryError exception

2014-07-19 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-2595:
--

 Summary: The driver run garbage collection, when the executor 
throws OutOfMemoryError exception
 Key: SPARK-2595
 URL: https://issues.apache.org/jira/browse/SPARK-2595
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Guoqiang Li


[SPARK-1103|https://issues.apache.org/jira/browse/SPARK-1103] implementation 
GC-based cleaning only consider the memory usage of the drive. We should 
consider more factors to trigger gc.includes executor exit code, task 
exception, task gc time .

[~pwendell]'s proposal:
if we detect memory pressure on the executors we should try to trigger a GC on 
the driver so that if there happen to be RDD's that have gone out of scope on 
the driver side, their associated cache blocks will be cleaned up on executors 
and free up memory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2591) Add config property to disable incremental collection used in Thrift server

2014-07-19 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067563#comment-14067563
 ] 

Michael Armbrust commented on SPARK-2591:
-

We should benchmark this and make sure that there is measurable benefit to 
collecting all of the results at once.  I'd like to avoid additional 
configuration options where possible.

> Add config property to disable incremental collection used in Thrift server
> ---
>
> Key: SPARK-2591
> URL: https://issues.apache.org/jira/browse/SPARK-2591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>
> {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the 
> result set one partition at a time. This is useful to avoid OOM when the 
> result is large, but introduces extra job scheduling costs as each partition 
> is collected with a separate job. Users may want to disable this when the 
> result set is expected to be small.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2594) Add CACHE TABLE AS SELECT ...

2014-07-19 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-2594:
---

 Summary: Add CACHE TABLE  AS SELECT ...
 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2576:


Target Version/s: 1.1.0

> slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
> QL query on HDFS CSV file
> --
>
> Key: SPARK-2576
> URL: https://issues.apache.org/jira/browse/SPARK-2576
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.0.1
> Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
> slaves. 
> JDK 1.7.51 and Scala 2.10.4 on all nodes. 
> HDFS from CDH5.0.3
> Spark version: I tried both with the pre-built CDH5 spark package available 
> from http://spark.apache.org/downloads.html and by packaging spark with sbt 
> 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
> http://mesosphere.io/learn/run-spark-on-mesos/
> All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
>Reporter: Svend Vanderveken
>
> Execution of SQL query against HDFS systematically throws a class not found 
> exception on slave nodes when executing .
> (this was originally reported on the user list: 
> http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
> Sample code (ran from spark-shell): 
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
> // I get the same error when pointing to the folder 
> "hdfs://vm28:8020/test/cardata"
> val data = sc.textFile("hdfs://vm28:8020/test/cardata/part-0")
> val cars = data.map(_.split(",")).map ( ar => Car(ar(0).toLong, ar(1), 
> ar(2).toBoolean))
> cars.registerAsTable("mcars")
> val allgreens = sqlContext.sql("SELECT objectid from mcars where isGreen = 
> true")
> allgreens.collect.take(10).foreach(println)
> {code}
> Stack trace on the slave nodes: 
> {code}
> I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
> I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
> 20140714-142853-485682442-5050-25487-2
> 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20140714-142853-485682442-5050-25487-2
> 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
> mesos,mnubohadoop
> 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(mesos, 
> mnubohadoop)
> 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
> 14/07/16 13:01:17 INFO Remoting: Starting remoting
> 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://spark@vm23:38230]
> 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
> [akka.tcp://spark@vm23:38230]
> 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
> akka.tcp://spark@vm28:41632/user/MapOutputTracker
> 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
> akka.tcp://spark@vm28:41632/user/BlockManagerMaster
> 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-local-20140716130117-8ea0
> 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
> MB.
> 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
> = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
> 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
> 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
> 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
> 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
> 14/07/16 13:01:18 INFO Executor: Running task ID 2
> 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
> 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
> curMem=0, maxMem=309225062
> 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
> memory (estimated size 122.6 KB, free 294.8 MB)
> 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
> 0.294602722 s
> 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
> hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
> I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
> 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
> java.lang.NoClassDefFoundError: $line11/$read$
> at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)
> at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)
> at scala.collection.Iterator$$anon$11.next(

[jira] [Resolved] (SPARK-2591) Add config property to disable incremental collection used in Thrift server

2014-07-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2591.
-

Resolution: Duplicate

> Add config property to disable incremental collection used in Thrift server
> ---
>
> Key: SPARK-2591
> URL: https://issues.apache.org/jira/browse/SPARK-2591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>
> {{SparkSQLOperationManager}} uses {{RDD.toLocalIterator}} to collect the 
> result set one partition at a time. This is useful to avoid OOM when the 
> result is large, but introduces extra job scheduling costs as each partition 
> is collected with a separate job. Users may want to disable this when the 
> result set is expected to be small.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-07-19 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067540#comment-14067540
 ] 

Helena Edelson commented on SPARK-2593:
---

I should note that I'd be happy to do the changes. I am a committer to Akka 
Cluster.

> Add ability to pass an existing Akka ActorSystem into Spark
> ---
>
> Key: SPARK-2593
> URL: https://issues.apache.org/jira/browse/SPARK-2593
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core
>Reporter: Helena Edelson
>
> As a developer I want to pass an existing ActorSystem into StreamingContext 
> in load-time so that I do not have 2 actor systems running on a node.
> This would mean having spark's actor system on its own named-dispatchers as 
> well as exposing the new private creation of its own actor system.
> If it makes sense...
> I would like to create an Akka Extension that wraps around Spark/Spark 
> Streaming and Cassandra. So the creation would simply be this for a user
> val extension = SparkCassandra(system)
>  and using is as easy as:
> import extension._
> spark. // do work or, 
> streaming. // do work
>  
> and all config comes from reference.conf and user overrides of that.
> The conf file would pick up settings from the deployed environment first, 
> then fallback to -D with a final fallback to configured settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-07-19 Thread Helena Edelson (JIRA)
Helena Edelson created SPARK-2593:
-

 Summary: Add ability to pass an existing Akka ActorSystem into 
Spark
 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Reporter: Helena Edelson


As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.

If it makes sense...

I would like to create an Akka Extension that wraps around Spark/Spark 
Streaming and Cassandra. So the creation would simply be this for a user

val extension = SparkCassandra(system)
 and using is as easy as:

import extension._
spark. // do work or, 
streaming. // do work
 
and all config comes from reference.conf and user overrides of that.
The conf file would pick up settings from the deployed environment first, then 
fallback to -D with a final fallback to configured settings.





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067512#comment-14067512
 ] 

Sean Owen commented on SPARK-2420:
--

https://github.com/srowen/spark/commit/f111393131008b72f641233ee9f5cb6f6cb4ff10

In terms of rectifying compile errors, downgrading to 11 is straightforward. 
There is one non-trivial change. Previously takeOrdered had used Guava's 
Ordering.leastOf(Iterator,int) to take k smallest elements from an Iterator, 
then add those k to a BoundedPriorityQueue. This method is not available in 11. 
However, it does not seem necessary to select the smallest k before putting 
into a priority queue bounded to size k? the result is the same if I understand 
correctly. Staring at the code, I think Guava's optimization makes the whole 
process O(n + k log k) where n is the number of elements in the iterator, 
whereas the straightforward approach is O(n log k). I'd imagine the 
straightforward approach wins for small k, even. Not sure if there is some 
history on this particular choice.

> Change Spark build to minimize library conflicts
> 
>
> Key: SPARK-2420
> URL: https://issues.apache.org/jira/browse/SPARK-2420
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.0.0
>Reporter: Xuefu Zhang
> Attachments: spark_1.0.0.patch
>
>
> During the prototyping of HIVE-7292, many library conflicts showed up because 
> Spark build contains versions of libraries that's vastly different from 
> current major Hadoop version. It would be nice if we can choose versions 
> that's in line with Hadoop or shading them in the assembly. Here are the wish 
> list:
> 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
> 2. Shading Spark's jetty and servlet dependency in the assembly.
> 3. guava version difference. Spark is using a higher version. I'm not sure 
> what's the best solution for this.
> The list may grow as HIVE-7292 proceeds.
> For information only, the attached is a patch that we applied on Spark in 
> order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1

2014-07-19 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067504#comment-14067504
 ] 

Guoqiang Li commented on SPARK-1997:


I'm sorry, came late.
the breeze 0.8.1 jar has {{6916}} files.
dependencies changes in breeze 0.8.1:
||changes ||packages||license||
|additional|org.scalamacros:quasiquotes_2.10:2.0.0-M8|BSD-like|
|additional|com.typesafe.scala-logging:scala-logging-slf4j_2.10:2.1.2|Apache 
2.0|
|remove|ccom.typesafe:scalalogging-slf4j_2.10:1.0.1|Apache 2.0|
|upgrade|org.scalanlp:breeze-macros_2.10:0.7.4|Apache 2.0|



> Update breeze to version 0.8.1
> --
>
> Key: SPARK-1997
> URL: https://issues.apache.org/jira/browse/SPARK-1997
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> {{breeze 0.7}} does not support {{scala 2.11}} .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-07-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067503#comment-14067503
 ] 

William Benton commented on SPARK-2226:
---

[~rxin], yes, and I'm mostly done.  I'll post a PR soon!

> HAVING should be able to contain aggregate expressions that don't appear in 
> the aggregation list. 
> --
>
> Key: SPARK-2226
> URL: https://issues.apache.org/jira/browse/SPARK-2226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: William Benton
>
> https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
> This test file contains the following query:
> {code}
> SELECT key FROM src GROUP BY key HAVING max(value) > "val_255";
> {code}
> Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark

2014-07-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067449#comment-14067449
 ] 

Xiangrui Meng commented on SPARK-2552:
--

PR: https://github.com/apache/spark/pull/1493

> Stabilize the computation of logistic function in pyspark
> -
>
> Key: SPARK-2552
> URL: https://issues.apache.org/jira/browse/SPARK-2552
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>  Labels: Starter
>
> exp(1000) throws an error in python. For logistic function, we can use either 
> 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which 
> ensuring exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1

2014-07-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067450#comment-14067450
 ] 

Xiangrui Meng commented on SPARK-1997:
--

PR: https://github.com/apache/spark/pull/940

> Update breeze to version 0.8.1
> --
>
> Key: SPARK-1997
> URL: https://issues.apache.org/jira/browse/SPARK-1997
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> {{breeze 0.7}} does not support {{scala 2.11}} .



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2495) Ability to re-create ML models

2014-07-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067445#comment-14067445
 ] 

Xiangrui Meng commented on SPARK-2495:
--

I sent out a PR for linear models: https://github.com/apache/spark/pull/1492 . 
For MatrixFactorizationModel, one thing we are not sure is the type of ids. But 
we definitely should make those constructors available in v1.1.

> Ability to re-create ML models
> --
>
> Key: SPARK-2495
> URL: https://issues.apache.org/jira/browse/SPARK-2495
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Alexander Albul
>Assignee: Alexander Albul
>
> Hi everyone.
> Previously (prior to Spark 1.0) we was working with MLib like this:
> 1) Calculate model (costly operation)
> 2) Take model and collect it's fields like weights, intercept e.t.c.
> 3) Store model somewhere in our format
> 4) Do predictions by loading model attributes, creating new model and 
> predicting using it.
> Now i see that model's constructors have *private* modifier and cannot be 
> created from outside.
> If you want to hide implementation details and keep this constructor as 
> "developer api", why not to create at least method, which will take weights, 
> intercept (for example) an materialize that model?
> A good example of model that i am talking about is: *LinearRegressionModel*
> I know that *LinearRegressionWithSGD* class have *createModel* method but the 
> problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2552) Stabilize the computation of logistic function in pyspark

2014-07-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067419#comment-14067419
 ] 

Xiangrui Meng commented on SPARK-2552:
--

It is not necessary to check the ranges because exp never underflows on a 
negative number. So the function is just

{code}
def logistic(x):
  if x > 0:
return 1 / (1 + math.exp(-x))
  else
return 1 - 1 / (1 + math.exp(x))
{code}


> Stabilize the computation of logistic function in pyspark
> -
>
> Key: SPARK-2552
> URL: https://issues.apache.org/jira/browse/SPARK-2552
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>  Labels: Starter
>
> exp(1000) throws an error in python. For logistic function, we can use either 
> 1 / ( 1 + exp( -x ) ) or 1 - 1 / (1 + exp( x ) ) to compute its value which 
> ensuring exp always takes a negative value.



--
This message was sent by Atlassian JIRA
(v6.2#6252)