[jira] [Updated] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7379:
-
Component/s: PySpark

> pickle.loads expects a string instead of bytes in Python 3.
> ---
>
> Key: SPARK-7379
> URL: https://issues.apache.org/jira/browse/SPARK-7379
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>
> In PickleSerializer, we call pickle.loads in Python 3. However, the input obj 
> could be bytes, which works in Python 2 but not 3.
> The error message is
> {code}
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py",
>  line 418, in loads
> return pickle.loads(obj, encoding=encoding)
> TypeError: must be a unicode character, not bytes
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7373) Support launching Spark drivers in Docker images with Mesos cluster mode

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7373:
-
Component/s: Mesos

> Support launching Spark drivers in Docker images with Mesos cluster mode
> 
>
> Key: SPARK-7373
> URL: https://issues.apache.org/jira/browse/SPARK-7373
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Support launching Spark drivers in Docker images with Mesos cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3134:
-
Target Version/s:   (was: 1.2.0)

> Update block locations asynchronously in TorrentBroadcast
> -
>
> Key: SPARK-3134
> URL: https://issues.apache.org/jira/browse/SPARK-3134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Reporter: Reynold Xin
>
> Once the TorrentBroadcast gets the data blocks, it needs to tell the master 
> the new location. We should make the location update non-blocking to reduce 
> roundtrips we need to launch tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3684) Can't configure local dirs in Yarn mode

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3684:
-
Target Version/s:   (was: 1.2.0)

> Can't configure local dirs in Yarn mode
> ---
>
> Key: SPARK-3684
> URL: https://issues.apache.org/jira/browse/SPARK-3684
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked 
> up in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either 
> because these are overridden by Yarn.
> I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the 
> default behavior is for Spark to use Yarn's local dirs, but right now there's 
> no way to change it even if the user wants to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3631:
-
Target Version/s:   (was: 1.2.0)

> Add docs for checkpoint usage
> -
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1832) Executor UI improvement suggestions

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1832:
-
Target Version/s:   (was: 1.2.0)

> Executor UI improvement suggestions
> ---
>
> Key: SPARK-1832
> URL: https://issues.apache.org/jira/browse/SPARK-1832
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
>  Fill some of the cells with color in order to make it easier to absorb 
> the info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> - if dark blue then write the value in white (same for the RED and GREEN above
> Maybe mark the MASTER task somehow
>  
> Report the TOTALS in each column (do this at the TOP so no need to scroll 
> to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3630:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

> Identify cause of Kryo+Snappy PARSING_ERROR
> ---
>
> Key: SPARK-3630
> URL: https://issues.apache.org/jira/browse/SPARK-3630
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>
> A recent GraphX commit caused non-deterministic exceptions in unit tests so 
> it was reverted (see SPARK-3400).
> Separately, [~aash] observed the same exception stacktrace in an 
> application-specific Kryo registrator:
> {noformat}
> com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
> uncompress the chunk: PARSING_ERROR(2)
> com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
> com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
> com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
> com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
>  
> com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
>  
> com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
>  
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> ...
> {noformat}
> This ticket is to identify the cause of the exception in the GraphX commit so 
> the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3385) Improve shuffle performance

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3385:
-
Target Version/s:   (was: 1.3.0)

> Improve shuffle performance
> ---
>
> Key: SPARK-3385
> URL: https://issues.apache.org/jira/browse/SPARK-3385
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> Just a ticket to track various efforts related to improving shuffle in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1762) Add functionality to pin RDDs in cache

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1762:
-
Target Version/s:   (was: 1.2.0)

> Add functionality to pin RDDs in cache
> --
>
> Key: SPARK-1762
> URL: https://issues.apache.org/jira/browse/SPARK-1762
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> Right now, all RDDs are created equal, and there is no mechanism to identify 
> a certain RDD to be more important than the rest. This is a problem if the 
> RDD fraction is small, because just caching a few RDDs can evict more 
> important ones.
> A side effect of this feature is that we can now more safely allocate a 
> smaller spark.storage.memoryFraction if we know how large our important RDDs 
> are, without having to worry about them being evicted. This allows us to use 
> more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3513) Provide a utility for running a function once on each executor

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3513:
-
Target Version/s:   (was: 1.2.0)

> Provide a utility for running a function once on each executor
> --
>
> Key: SPARK-3513
> URL: https://issues.apache.org/jira/browse/SPARK-3513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> This is minor, but it would be nice to have a utility where you can pass a 
> function and it will run some arbitrary function once on each each executor 
> and return the result to you (e.g. you could perform a jstack from within the 
> JVM). You could probably hack it together with custom locality preferences, 
> accessing the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4752) Classifier based on artificial neural network

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4752:
-
Target Version/s:   (was: 1.3.0)

> Classifier based on artificial neural network
> -
>
> Key: SPARK-4752
> URL: https://issues.apache.org/jira/browse/SPARK-4752
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Implement classifier based on artificial neural network (ANN). Requirements:
> 1) Use the existing artificial neural network implementation 
> https://issues.apache.org/jira/browse/SPARK-2352, 
> https://github.com/apache/spark/pull/1290
> 2) Extend MLlib ClassificationModel trait, 
> 3) Like other classifiers in MLlib, accept RDD[LabeledPoint] for training,
> 4) Be able to return the ANN model



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3441:
-
Target Version/s:   (was: 1.2.0)

> Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
> shuffle
> ---
>
> Key: SPARK-3441
> URL: https://issues.apache.org/jira/browse/SPARK-3441
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>
> I think it would be good to say something like this in the doc for 
> repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
> {code}
> This can be used to enact a "Hadoop Style" shuffle along with a call to 
> mapPartitions, e.g.:
>rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
> {code}
> It might also be nice to add a version that doesn't take a partitioner and/or 
> to mention this in the groupBy javadoc. I guess it depends a bit whether we 
> consider this to be an API we want people to use more widely or whether we 
> just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
> people to consider this API when porting workloads from Hadoop, then it might 
> be worth documenting better.
> What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3629:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

> Improvements to YARN doc
> 
>
> Key: SPARK-3629
> URL: https://issues.apache.org/jira/browse/SPARK-3629
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Matei Zaharia
>  Labels: starter
>
> Right now this doc starts off with a big list of config options, and only 
> then tells you how to submit an app. It would be better to put that part and 
> the packaging part first, and the config options only at the end.
> In addition, the doc mentions yarn-cluster vs yarn-client as separate 
> masters, which is inconsistent with the help output from spark-submit (which 
> says to always use "yarn").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3982) receiverStream in Python API

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3982:
-
Target Version/s:   (was: 1.2.0)

> receiverStream in Python API
> 
>
> Key: SPARK-3982
> URL: https://issues.apache.org/jira/browse/SPARK-3982
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Streaming
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> receiverStream() is used to extend the input sources of streaming, it will be 
> very useful to have it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3166) Custom serialisers can't be shipped in application jars

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3166:
-
Target Version/s:   (was: 1.2.0)

> Custom serialisers can't be shipped in application jars
> ---
>
> Key: SPARK-3166
> URL: https://issues.apache.org/jira/browse/SPARK-3166
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Graham Dennis
>
> Spark cannot currently use a custom serialiser that is shipped with the 
> application jar. Trying to do this causes a java.lang.ClassNotFoundException 
> when trying to instantiate the custom serialiser in the Executor processes. 
> This occurs because Spark attempts to instantiate the custom serialiser 
> before the application jar has been shipped to the Executor process. A 
> reproduction of the problem is available here: 
> https://github.com/GrahamDennis/spark-custom-serialiser
> I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches 
> as of August 21, 2014.  This issue is related to SPARK-2878, and my fix for 
> that issue (https://github.com/apache/spark/pull/1890) also solves this.  My 
> pull request was not merged because it adds the user jar to the Executor 
> processes' class path at launch time.  Such a significant change was thought 
> by [~rxin] to require more QA, and should be considered for inclusion in 1.2 
> at the earliest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3514) Provide a utility function for returning the hosts (and number) of live executors

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3514:
-
Target Version/s:   (was: 1.2.0)

> Provide a utility function for returning the hosts (and number) of live 
> executors
> -
>
> Key: SPARK-3514
> URL: https://issues.apache.org/jira/browse/SPARK-3514
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> It would be nice to tell user applications how many executors they have 
> currently running in their application. Also, we could give them the host 
> names on which the executors are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3146) Improve the flexibility of Spark Streaming Kafka API to offer user the ability to process message before storing into BM

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3146:
-
Target Version/s:   (was: 1.2.0)

> Improve the flexibility of Spark Streaming Kafka API to offer user the 
> ability to process message before storing into BM
> 
>
> Key: SPARK-3146
> URL: https://issues.apache.org/jira/browse/SPARK-3146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming Kafka API stores the key and value of each message 
> into BM for processing, potentially this may lose the flexibility for 
> different requirements:
> 1. currently topic/partition/offset information for each message is discarded 
> by KafkaInputDStream. In some scenarios people may need this information to 
> better filter the message, like SPARK-2388 described.
> 2. People may need to add timestamp for each message when feeding into Spark 
> Streaming, which can better measure the system latency.
> 3. Checkpointing the partition/offsets or others...
> So here we add a messageHandler in interface to give people the flexibility 
> to preprocess the message before storing into BM. In the meantime time this 
> improvement keep compatible with current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4902) gap-sampling performance optimization

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4902:
-
Target Version/s:   (was: 1.2.0, 1.3.0)

> gap-sampling performance optimization
> -
>
> Key: SPARK-4902
> URL: https://issues.apache.org/jira/browse/SPARK-4902
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> {{CacheManager.getOrCompute}} returns an instance of InterruptibleIterator 
> that contains an array or a iterator(when the memory is not enough). 
> The GapSamplingIterator implementation is as follows
> {code}
> private val iterDrop: Int => Unit = {
> val arrayClass = Array.empty[T].iterator.getClass
> val arrayBufferClass = ArrayBuffer.empty[T].iterator.getClass
> data.getClass match {
>   case `arrayClass` => ((n: Int) => { data = data.drop(n) })
>   case `arrayBufferClass` => ((n: Int) => { data = data.drop(n) })
>   case _ => ((n: Int) => {
>   var j = 0
>   while (j < n && data.hasNext) {
> data.next()
> j += 1
>   }
> })
> }
>   }
> {code}
> The code does not deal with InterruptibleIterator.
> This leads to the following code can't use the {{Iterator.drop}} method
> {code}
> rdd.cache()
> rdd.sample(false,0.1)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2365:
-
Target Version/s:   (was: 1.2.0)

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3454:
-
Target Version/s:   (was: 1.2.0)

> Expose JSON representation of data shown in WebUI
> -
>
> Key: SPARK-3454
> URL: https://issues.apache.org/jira/browse/SPARK-3454
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Imran Rashid
> Fix For: 1.4.0
>
> Attachments: sparkmonitoringjsondesign.pdf
>
>
> If WebUI support to JSON format extracting, it's helpful for user who want to 
> analyse stage / task / executor information.
> Fortunately, WebUI has renderJson method so we can implement the method in 
> each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1823:
-
Target Version/s:   (was: 1.2.0)

> ExternalAppendOnlyMap can still OOM if one key is very large
> 
>
> Key: SPARK-1823
> URL: https://issues.apache.org/jira/browse/SPARK-1823
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Or
>
> If the values for one key do not collectively fit into memory, then the map 
> will still OOM when you merge the spilled contents back in.
> This is a problem especially for PySpark, since we hash the keys (Python 
> objects) before a shuffle, and there are only so many integers out there in 
> the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3374) Spark on Yarn config cleanup

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3374:
-
Target Version/s:   (was: 1.2.0)

> Spark on Yarn config cleanup
> 
>
> Key: SPARK-3374
> URL: https://issues.apache.org/jira/browse/SPARK-3374
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> The configs in yarn have gotten scattered and inconsistent between cluster 
> and client modes and supporting backwards compatibility.  We should try to 
> clean this up, move things to common places and support configs across both 
> cluster and client modes where we want to make them public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3461:
-
Target Version/s:   (was: 1.2.0)

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Sandy Ryza
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1642:
-
Target Version/s:   (was: 1.2.0)

> Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
> ---
>
> Key: SPARK-1642
> URL: https://issues.apache.org/jira/browse/SPARK-1642
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Ted Malaska
>Assignee: Ted Malaska
>Priority: Minor
>
> This will add support for SSL encryption between Flume AvroSink and Spark 
> Streaming.
> It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4106) Shuffle write and spill to disk metrics are incorrect

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4106:
-
Target Version/s:   (was: 1.2.0)

> Shuffle write and spill to disk metrics are incorrect
> -
>
> Key: SPARK-4106
> URL: https://issues.apache.org/jira/browse/SPARK-4106
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> I have an encountered a job which has some disk spilled (memory) but the disk 
> spilled (disk) is 0, as well as the shuffle write. If I switch to hash based 
> shuffle, where there happens to be no disk spilling, then the shuffle write 
> is correct.
> I can get more info on a workload to repro this situation, but perhaps that 
> state of events is sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4356) Test Scala 2.11 on Jenkins

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4356:
-
Target Version/s:   (was: 1.2.0)

> Test Scala 2.11 on Jenkins
> --
>
> Key: SPARK-4356
> URL: https://issues.apache.org/jira/browse/SPARK-4356
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> We need to make some modifications to the test harness so that we can test 
> Scala 2.11 in Maven regularly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3115) Improve task broadcast latency for small tasks

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3115:
-
Target Version/s:   (was: 1.2.0)

> Improve task broadcast latency for small tasks
> --
>
> Key: SPARK-3115
> URL: https://issues.apache.org/jira/browse/SPARK-3115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Shivaram Venkataraman
>Assignee: Reynold Xin
>
> Broadcasting the task information helps reduce the amount of data transferred 
> for large tasks. However we've seen that this adds more latency for small 
> tasks. It'll be great to profile and fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4134) Tone down scary executor lost messages when killing on purpose

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4134:
-
Target Version/s:   (was: 1.2.0)

> Tone down scary executor lost messages when killing on purpose
> --
>
> Key: SPARK-4134
> URL: https://issues.apache.org/jira/browse/SPARK-4134
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> After SPARK-3822 goes in, we are now able to dynamically kill executors after 
> an application has started. However, when we do that we get a ton of scary 
> error messages telling us that we've done wrong somehow. It would be good to 
> detect when this is the case and prevent these messages from surfacing.
> This maybe difficult, however, because the connection manager tends to be 
> quite verbose in unconditionally logging disconnection messages. This is a 
> very nice-to-have for 1.2 but certainly not a blocker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2532) Fix issues with consolidated shuffle

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2532:
-
Target Version/s:   (was: 1.2.0)

> Fix issues with consolidated shuffle
> 
>
> Key: SPARK-2532
> URL: https://issues.apache.org/jira/browse/SPARK-2532
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
> Environment: All
>Reporter: Mridul Muralidharan
>Priority: Critical
>
> Will file PR with changes as soon as merge is done (earlier merge became 
> outdated in 2 weeks unfortunately :) ).
> Consolidated shuffle is broken in multiple ways in spark :
> a) Task failure(s) can cause the state to become inconsistent.
> b) Multiple revert's or combination of close/revert/close can cause the state 
> to be inconsistent.
> (As part of exception/error handling).
> c) Some of the api in block writer causes implementation issues - for 
> example: a revert is always followed by close : but the implemention tries to 
> keep them separate, resulting in surface for errors.
> d) Fetching data from consolidated shuffle files can go badly wrong if the 
> file is being actively written to : it computes length by subtracting next 
> offset from current offset (or length if this is last offset)- the latter 
> fails when fetch is happening in parallel to write.
> Note, this happens even if there are no task failures of any kind !
> This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4568) Publish release candidates under $VERSION-RCX instead of $VERSION

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4568:
-
Target Version/s:   (was: 1.2.0)

> Publish release candidates under $VERSION-RCX instead of $VERSION
> -
>
> Key: SPARK-4568
> URL: https://issues.apache.org/jira/browse/SPARK-4568
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2371) Show locally-running tasks (e.g. from take()) in web UI

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2371:
-
Target Version/s:   (was: 1.2.0)

> Show locally-running tasks (e.g. from take()) in web UI
> ---
>
> Key: SPARK-2371
> URL: https://issues.apache.org/jira/browse/SPARK-2371
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Matei Zaharia
>
> It's somewhat confusing that these don't show up, so you wonder whether your 
> job is frozen. We probably need to give them a stage ID and somehow mark them 
> specially in the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3218) K-Means clusterer can fail on degenerate data

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3218:
-
Target Version/s:   (was: 1.3.0)

> K-Means clusterer can fail on degenerate data
> -
>
> Key: SPARK-3218
> URL: https://issues.apache.org/jira/browse/SPARK-3218
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> The KMeans parallel implementation selects points to be cluster centers with 
> probability weighted by their distance to cluster centers.  However, if there 
> are fewer than k DISTINCT points in the data set, this approach will fail.  
> Further, the recent checkin to work around this problem results in selection 
> of the same point repeatedly as a cluster center. 
> The fix is to allow fewer than k cluster centers to be selected.  This 
> requires several changes to the code, as the number of cluster centers is 
> woven into the implementation.
> I have a version of the code that addresses this problem, AND generalizes the 
> distance metric.  However, I see that there are literally hundreds of 
> outstanding pull requests.  If someone will commit to working with me to 
> sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3100) Spark RDD partitions are not running in the workers as per locality information given by each partition.

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3100:
-
Target Version/s:   (was: 1.2.0)

> Spark RDD partitions are not running in the workers as per locality 
> information given by each partition.
> 
>
> Key: SPARK-3100
> URL: https://issues.apache.org/jira/browse/SPARK-3100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Running in Spark Standalone Cluster
>Reporter: Ravindra Pesala(Old.Don't assign to it)
>
> I created a simple custom RDD (SampleRDD.scala)and created 4 splits for 4 
> workers.
> When I run this RDD in Spark standalone cluster with 4 workers(even master 
> machine has one worker), it runs all partitions in one node only even though 
> I have given locality preferences in my SampleRDD program. 
> *Sample Code*
> {code}
> class SamplePartition(rddId: Int, val idx: Int,val tableSplit:Seq[String])
>   extends Partition {
>   override def hashCode(): Int = 41 * (41 + rddId) + idx 
>   override val index: Int = idx
> }
> class SampleRDD[K,V](
> sc : SparkContext,keyClass: KeyVal[K,V])
>   extends RDD[(K,V)](sc, Nil)
>   with Logging {
>   override def getPartitions: Array[Partition] = {
> val hosts = Array("master","slave1","slave2","slave3")
> val result = new Array[Partition](4)
> for (i <- 0 until result.length) 
> {
>   result(i) = new SamplePartition(id, i, Array(hosts(i)))
> }
> result
>   }
>   
>   
>   override def compute(theSplit: Partition, context: TaskContext) = {
> val iter = new Iterator[(K,V)] {
>   val split = theSplit.asInstanceOf[SamplePartition]
>   logInfo("Executed task for the split" + split.tableSplit)
> 
>   // Register an on-task-completion callback to close the input stream.
>   context.addOnCompleteCallback(() => close())
>   var havePair = false
>   var finished = false
>   override def hasNext: Boolean = {
> if (!finished && !havePair) 
> {
>   finished = !false
>   havePair = !finished
> }
> !finished
>   }
>   override def next(): (K,V) = {
> if (!hasNext) {
>   throw new java.util.NoSuchElementException("End of stream")
> }
> havePair = false
> val key = new Key()
> val value = new Value()
> keyClass.getKey(key, value)
>   }
>   private def close() {
> try {
> //  reader.close()
> } catch {
>   case e: Exception => logWarning("Exception in 
> RecordReader.close()", e)
> }
>   }
> }
> iter
>   }
>   
>   override def getPreferredLocations(split: Partition): Seq[String] = {
> val theSplit = split.asInstanceOf[SamplePartition]
> val s = theSplit.tableSplit.filter(_ != "localhost")
> logInfo("Host Name : "+s(0))
> s
>   }
> }
> trait KeyVal[K,V] extends Serializable {
>   def getKey(key : Key,value : Value) : (K,V) 
> }
> class KeyValImpl extends KeyVal[Key,Value] {
>   override def getKey(key : Key,value : Value) = (key,value)
> }
> case class Key()
> case class Value()
> object SampleRDD {
> def main(args: Array[String]) : Unit={
>   val d = SparkContext.jarOfClass(this.getClass)
>   val ar = new Array[String](d.size)
>   var i = 0
>   d.foreach{
> p=> ar(i)=p;
> i = i+1
> }   
>  val sc = new SparkContext("spark://master:7077", "SampleSpark", 
> "/opt/spark-1.0.0-rc3/",ar) 
>  val rdd = new SampleRDD(sc,new KeyValImpl());
>  rdd.collect;
> }
> }
> {code}
> Following is the log it shows.
> {code}
> INFO  18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/0 is now 
> RUNNING
> INFO  18-08 16:38:33,382 - Executor updated: app-20140818163833-0005/2 is now 
> RUNNING
> INFO  18-08 16:38:33,383 - Executor updated: app-20140818163833-0005/1 is now 
> RUNNING
> INFO  18-08 16:38:33,385 - Executor updated: app-20140818163833-0005/3 is now 
> RUNNING
> INFO  18-08 16:38:34,976 - Registered executor: Actor 
> akka.tcp://sparkExecutor@master:47563/user/Executor#-398354094 with ID 0
> INFO  18-08 16:38:34,984 - Starting task 0.0:0 as TID 0 on executor 0: master 
> (PROCESS_LOCAL)
> INFO  18-08 16:38:34,989 - Serialized task 0.0:0 as 1261 bytes in 3 ms
> INFO  18-08 16:38:34,992 - Starting task 0.0:1 as TID 1 on executor 0: master 
> (PROCESS_LOCAL)
> INFO  18-08 16:38:34,993 - Serialized task 0.0:1 as 1261 bytes in 0 ms
> INFO  18-08 16:38:34,993 - Starting task 0.0:2 as TID 2 on executor 0: master 
> (PROCESS_LOCAL)*
> INFO  18-08 16:38:34,993 - Serialized task 0.0:2 as 1261 bytes in 0 ms
> INFO  18-08 16:38:34,

[jira] [Updated] (SPARK-3257) Enable :cp to add JARs in spark-shell (Scala 2.11)

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3257:
-
Target Version/s:   (was: 1.2.0)

> Enable :cp to add JARs in spark-shell (Scala 2.11)
> --
>
> Key: SPARK-3257
> URL: https://issues.apache.org/jira/browse/SPARK-3257
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: Matei Zaharia
>Assignee: Heather Miller
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2992:
-
Target Version/s:   (was: 1.2.0)

> The transforms formerly known as non-lazy
> -
>
> Key: SPARK-2992
> URL: https://issues.apache.org/jira/browse/SPARK-2992
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Erik Erlandson
>
> An umbrella for a grab-bag of tickets involving lazy implementations of 
> transforms formerly thought to be non-lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3137:
-
Target Version/s:   (was: 1.2.0)

> Use finer grained locking in TorrentBroadcast.readObject
> 
>
> Key: SPARK-3137
> URL: https://issues.apache.org/jira/browse/SPARK-3137
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>
> TorrentBroadcast.readObject uses a global lock so only one task can be 
> fetching the blocks at the same time.
> This is not optimal if we are running multiple stages concurrently because 
> they should be able to independently fetch their own blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2868) Support named accumulators in Python

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2868:
-
Target Version/s:   (was: 1.2.0)

> Support named accumulators in Python
> 
>
> Key: SPARK-2868
> URL: https://issues.apache.org/jira/browse/SPARK-2868
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Patrick Wendell
>
> SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to 
> make some additional changes. One potential path is to have a 1:1 
> correspondence with Scala accumulators (instead of a one-to-many). A 
> challenge is exposing the stringified values of the accumulators to the Scala 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4681) Turn on host level blacklisting by default

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4681:
-
Target Version/s:   (was: 1.3.0)

> Turn on host level blacklisting by default
> --
>
> Key: SPARK-4681
> URL: https://issues.apache.org/jira/browse/SPARK-4681
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>
> Per discussion in https://github.com/apache/spark/pull/3541.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1239:
-
Target Version/s:   (was: 1.2.0)

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3075:
-
Target Version/s:   (was: 1.2.0)

> Expose a way for users to parse event logs
> --
>
> Key: SPARK-3075
> URL: https://issues.apache.org/jira/browse/SPARK-3075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>
> Both ReplayListenerBus and util.JsonProtocol are private[spark], so the user 
> wants to parse the event logs themselves for analytics they will have to 
> write their own JSON deserializers (or do some crazy reflection to access 
> these methods). We should expose an easy way for them to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2774) Set preferred locations for reduce tasks

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2774:
-
Target Version/s:   (was: 1.2.0)

> Set preferred locations for reduce tasks
> 
>
> Key: SPARK-2774
> URL: https://issues.apache.org/jira/browse/SPARK-2774
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>
> Currently we do not set preferred locations for reduce tasks in Spark. This 
> patch proposes setting preferred locations based on the map output sizes and 
> locations tracked by the MapOutputTracker. This is useful in two conditions
> 1. When you have a small job in a large cluster it can be useful to co-locate 
> map and reduce tasks to avoid going over the network
> 2. If there is a lot of data skew in the map stage outputs, then it is 
> beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3031) Create JsonSerializable and move JSON serialization from JsonProtocol into each class

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3031:
-
Target Version/s:   (was: 1.2.0)

> Create JsonSerializable and move JSON serialization from JsonProtocol into 
> each class
> -
>
> Key: SPARK-3031
> URL: https://issues.apache.org/jira/browse/SPARK-3031
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>
> It is really, really weird that we have a single file JsonProtocol that 
> handles the JSON serialization/deserialization for a bunch of classes. This 
> is very error prone because it is easy to add a new field to a class, and 
> forget to update the JSON part of it. Or worse, add a new event class and 
> forget to add one, as evidenced by 
> https://issues.apache.org/jira/browse/SPARK-3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4609) Job can not finish if there is one bad slave in clusters

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4609:
-
Target Version/s:   (was: 1.3.0)

> Job can not finish if there is one bad slave in clusters
> 
>
> Key: SPARK-4609
> URL: https://issues.apache.org/jira/browse/SPARK-4609
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>
> If there is one bad machine in the cluster, the executor will keep die (such 
> as out of space in the disk), some task may be scheduled to this machines 
> multiple times, then the job will failed after several failures of one task.
> {code}
> 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 
> 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, 
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 
> lost)
> 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 
> 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, 
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 
> lost)
> 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 
> 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, 
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 
> lost)
> 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 
> 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
> 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, 
> spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 
> lost)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
> stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 
> (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure 
> (executor 63 lost)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> The task should not be scheduled to a machines for more than one times. Also, 
> if one machine failed with executor lost, it should be put in black list for 
> some time, then try again.
> cc [~kayousterhout] [~matei]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1312) Batch should read based on the batch interval provided in the StreamingContext

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1312:
-
Target Version/s:   (was: 1.2.0)

> Batch should read based on the batch interval provided in the StreamingContext
> --
>
> Key: SPARK-1312
> URL: https://issues.apache.org/jira/browse/SPARK-1312
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: Sanjay Awatramani
>Assignee: Tathagata Das
>Priority: Critical
>  Labels: sliding, streaming, window
>
> This problem primarily affects sliding window operations in spark streaming.
> Consider the following scenario:
> - a DStream is created from any source. (I've checked with file and socket)
> - No actions are applied to this DStream
> - Sliding Window operation is applied to this DStream and an action is 
> applied to the sliding window.
> In this case, Spark will not even read the input stream in the batch in which 
> the sliding interval isn't a multiple of batch interval. Put another way, it 
> won't read the input when it doesn't have to apply the window function. This 
> is happening because all transformations in Spark are lazy.
> How to fix this or workaround it (see line#3):
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new 
> Duration(1 * 60 * 1000));
> JavaDStream inputStream = stcObj.textFileStream("/Input");
> inputStream.print(); // This is the workaround
> JavaDStream objWindow = inputStream.window(new 
> Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
> The "Window operations" example on the streaming guide implies that Spark 
> will read the stream in every batch, which is not happening because of the 
> lazy transformations.
> Wherever sliding window would be used, in most of the cases, no actions will 
> be taken on the pre-window batch, hence my gut feeling was that Streaming 
> would read every batch if any actions are being taken in the windowed stream.
> As per Tathagata,
> "Ideally every batch should read based on the batch interval provided in the 
> StreamingContext."
> Refer the original thread on 
> http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html
>  for more details, including Tathagata's conclusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2653:
-
Target Version/s:   (was: 1.2.0)

> Heap size should be the sum of driver.memory and executor.memory in local mode
> --
>
> Key: SPARK-2653
> URL: https://issues.apache.org/jira/browse/SPARK-2653
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In local mode, the driver and executor run in the same JVM, so the heap size 
> of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3685) Spark's local dir should accept only local paths

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3685:
-
Target Version/s:   (was: 1.2.0)

> Spark's local dir should accept only local paths
> 
>
> Key: SPARK-3685
> URL: https://issues.apache.org/jira/browse/SPARK-3685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> When you try to set local dirs to "hdfs:/tmp/foo" it doesn't work. What it 
> will try to do is create a folder called "hdfs:" and put "tmp" inside it. 
> This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
> of Hadoop's file system to parse this path. We also need to resolve the path 
> appropriately.
> This may not have an urgent use case, but it fails silently and does what is 
> least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4488) Add control over map-side aggregation

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4488:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

> Add control over map-side aggregation
> -
>
> Key: SPARK-4488
> URL: https://issues.apache.org/jira/browse/SPARK-4488
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: uncleGen
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3095) [PySpark] Speed up RDD.count()

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3095:
-
Target Version/s:   (was: 1.2.0)

> [PySpark] Speed up RDD.count()
> --
>
> Key: SPARK-3095
> URL: https://issues.apache.org/jira/browse/SPARK-3095
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
>
> RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not 
> PipelineRDD.
> If the JavaRDD is serialized in batch mode, it's possible to skip the 
> deserialization of chunks (except the last one), because they all have the 
> same number of elements in them. There are some special cases that the chunks 
> are re-ordered, so this will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3913) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3913:
-
Target Version/s:   (was: 1.2.0)

> Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn 
> Application Listener and killApplication() API
> --
>
> Key: SPARK-3913
> URL: https://issues.apache.org/jira/browse/SPARK-3913
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Chester
>
> When working with Spark with Yarn deployment mode, we have two issues:
> 1) We don't know how much yarn max capacity ( memory and cores) before we 
> specify the number of executor and memories for spark drivers and executors. 
> We we set a big number, the job can potentially exceeds the limit and got 
> killed. 
>It would be better we let the application know that the yarn resource 
> capacity a head of time and the spark config can adjusted dynamically. 
>   
> 2) Once job started, we would like to have some feedbacks from yarn 
> application. Currently, the spark client basically block the call and returns 
> when the job is finished or failed or killed. 
> If the job runs for few hours, we have no idea how far it has gone, the 
> progress and resource usage, tracking URL etc. 
> 3) Once the job is started, you basically can't stop it. The Yarn Client API 
> stop doesn't to work in most cases from our experience.  But Yarn API does 
> work is killApplication(appId). 
>So we need to expose this killApplication() API to Spark Yarn Client as 
> well. 
>
> I will create one Pull Request and try to address these problems.  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2838) performance tests for feature transformations

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2838:
-
Target Version/s:   (was: 1.2.0)

> performance tests for feature transformations
> -
>
> Key: SPARK-2838
> URL: https://issues.apache.org/jira/browse/SPARK-2838
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> 1. TF-IDF
> 2. StandardScaler
> 3. Normalizer
> 4. Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3348) Support user-defined SparkListeners properly

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3348:
-
Target Version/s:   (was: 1.2.0)

> Support user-defined SparkListeners properly
> 
>
> Key: SPARK-3348
> URL: https://issues.apache.org/jira/browse/SPARK-3348
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> Because of the current initialization ordering, user-defined SparkListeners 
> do not receive certain events that are posted before application code is run. 
> We need to expose a constructor that allows the given SparkListeners to 
> receive all events.
> There have been interest in this for a while, but I have searched through the 
> JIRA history and have not found a related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4784) Model.fittingParamMap should store all Params

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4784:
-
Target Version/s:   (was: 1.3.0)

> Model.fittingParamMap should store all Params
> -
>
> Key: SPARK-4784
> URL: https://issues.apache.org/jira/browse/SPARK-4784
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.ml's Model class should store all parameters in the fittingParamMap, 
> not just the ones which were explicitly set.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3492) Clean up Yarn integration code

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3492:
-
Target Version/s:   (was: 1.2.0)

> Clean up Yarn integration code
> --
>
> Key: SPARK-3492
> URL: https://issues.apache.org/jira/browse/SPARK-3492
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is the parent umbrella for cleaning up the Yarn integration code in 
> general. This is a broad effort and each individual cleanup should opened as 
> a sub-issue against this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2315:
-
Target Version/s:   (was: 1.2.0)

> drop, dropRight and dropWhile which take RDD input and return RDD
> -
>
> Key: SPARK-2315
> URL: https://issues.apache.org/jira/browse/SPARK-2315
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>  Labels: features
>
> Last time I loaded in a text file, I found myself wanting to just skip the 
> first element as it was a header. I wrote candidate methods drop, 
> dropRight and dropWhile to satisfy this kind of need:
> val txt = sc.textFile("text_with_header.txt")
> val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5113:
-
Target Version/s:   (was: 1.3.0)

> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier (akka supports 
> only supplying a single name which it uses both as the bind interface and as 
> the actor identifier). In some cases, that hostname is used as the bind 
> hostname also (e.g. I think this happens in the connection manager and 
> possibly akka) - which will likely internally result in a re-resolution of 
> this to an IP address. In other cases (the web UI and netty shuffle) we seem 
> to bind to all interfaces.
> The best outcome would be to have three configs that can be set on each 
> machine:
> {code}
> SPARK_LOCAL_IP # Ip address we bind to for all services
> SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within 
> the cluster
> SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the 
> cluster (e.g. the UI)
> {code}
> It's not clear how easily we can support that scheme while providing 
> backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - 
> it's just an alias for what is now SPARK_PUBLIC_DNS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3917) Compress data before network transfer

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3917:
-
Target Version/s:   (was: 1.2.0)

> Compress data before network transfer
> -
>
> Key: SPARK-3917
> URL: https://issues.apache.org/jira/browse/SPARK-3917
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: All
>Reporter: junlong
>
> When training Gradient Boosting Decision Tree on large sparse data, heavy 
> network flow pull down CPU utilization ratio. And through compression on 
> network flow data, 90% are reduced. 
> So maybe compression before transfering may provide higher speedup on 
> spark. And user can configure it to decide whether compress or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3376:
-
Target Version/s:   (was: 1.3.0)

> Memory-based shuffle strategy to reduce overhead of disk I/O
> 
>
> Key: SPARK-3376
> URL: https://issues.apache.org/jira/browse/SPARK-3376
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: uncleGen
>  Labels: performance
>
> I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
> want to know is there any plan to do something about it. Or any suggestion 
> about it. Base on the work (SPARK-2044), it is feasible to have several 
> implementations of  shuffle.
> 
> Currently, there are two implementions of shuffle manager, i.e. SORT and 
> HASH. Both of them will use disk in some stages. For examples, in the map 
> side, all the intermediate data will be written into temporary files. In the 
> reduce side, Spark will use external sort sometimes. In any case, disk I/O 
> will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
> manager. In this shuffle manager, intermediate data will only go through 
> memory. In some of scenes, it can improve performance. Experimentally, I 
> implemented a in-memory shuffle manager upon SPARK-2044. 
> 1. Following is my testing result (some heary shuffle operations):
> | data size (Byte)   |  partitions  |  resources |
> | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
> | settings   |  operation1   | 
> operation2 |
> | shuffle spill & lz4 |  repartition+flatMap+groupByKey | repartition + 
> groupByKey | 
> |memory   |   38s   |  16s |
> |sort |   45s   |  28s |
> |hash |   46s   |  28s |
> |no shuffle spill & lz4 | | |
> | memory |   16s | 16s |
> | | | |
> |shuffle spill & lzf | | |
> |memory|  28s   | 27s |
> |sort  |  29s   | 29s |
> |hash  |  41s   | 30s |
> |no shuffle spill & lzf | | |
> | memory |  15s | 16s |
> In my implementation, I simply reused the "BlockManager" in the map-side and 
> set the "spark.shuffle.spill" false in the reduce-side. All the intermediate 
> data is cached in memory store. Just as Reynold Xin has pointed out, our 
> disk-based shuffle manager has achieved a good performance. With  parameter 
> tuning, the disk-based shuffle manager will  obtain similar performance as 
> memory-based shuffle manager. However, I will continue my work and improve 
> it. And as an alternative tuning option, "InMemory shuffle" is a good choice. 
> Future work includes, but is not limited to:
> - memory usage management in "InMemory Shuffle" mode
> - data management when intermediate data can not fit in memory
> Test code:
> {code: borderStyle=solid}
> val conf = new SparkConf().setAppName("InMemoryShuffleTest")
> val sc = new SparkContext(conf)
> val dataPath = args(0)
> val partitions = args(1).toInt
> val rdd1 = sc.textFile(dataPath).cache()
> rdd1.count()
> val startTime = System.currentTimeMillis()
> val rdd2 = rdd1.repartition(partitions)
>   .flatMap(_.split(",")).map(s => (s, s))
>   .groupBy(e => e._1)
> rdd2.count()
> val endTime = System.currentTimeMillis()
> println("time: " + (endTime - startTime) / 1000 )
> {code}
> 2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning 
> for disk shuffle. 
> 2.1. Test the influence of memory size per core
> precondition: 100GB(SORT benchmark), 100 executor /15cores  1491partitions 
> (input file blocks) . 
> | memory size per executor| inmemory shuffle(no shuffle spill)  |  sort 
> shuffle  |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
> |9GB   |  79.652849s |  60.102337s | failed|   
> -32.7%|  -|
> |12GB  |  54.821924s |  51.654897s |109.167068s |   
> -3.17%|+47.8% | 
> |15GB  |  33.537199s |  40.140621s |48.088158s  |   
> +16.47%   |+30.26%|
> |18GB  |  30.930927s |  43.392401s |49.830276s  |   
> +28.7%|+37.93%| 
> 2.2. Test the influence of partition number
> 18GB/15cores per executor
> | partitions | inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
> shuffle |   improvement(vs.sort)  |  improvement(vs.hash) | 

[jira] [Updated] (SPARK-3916) recognize appended data in textFileStream()

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3916:
-
Target Version/s:   (was: 1.2.0)

> recognize appended data in textFileStream()
> ---
>
> Key: SPARK-3916
> URL: https://issues.apache.org/jira/browse/SPARK-3916
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only find new data from new files, the data written to old 
> files (processed in last batch) will not be processed.
> In order to support this, we need partialRDD(), which is an RDD for part of 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3132:
-
Target Version/s:   (was: 1.2.0)

> Avoid serialization for Array[Byte] in TorrentBroadcast
> ---
>
> Key: SPARK-3132
> URL: https://issues.apache.org/jira/browse/SPARK-3132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> If the input data is a byte array, we should allow TorrentBroadcast to skip 
> serializing and compressing the input.
> To do this, we should add a new parameter (shortCircuitByteArray) to 
> TorrentBroadcast, and then avoid serialization in if the input is byte array 
> and shortCircuitByteArray is true.
> We should then also do compression in task serialization itself instead of 
> doing it in TorrentBroadcast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2999) Compress all the serialized data

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2999:
-
Target Version/s:   (was: 1.2.0)

> Compress all the serialized data
> 
>
> Key: SPARK-2999
> URL: https://issues.apache.org/jira/browse/SPARK-2999
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> LZ4 is so fast that we can have performance benefit for all network/disk IO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3051) Support looking-up named accumulators in a registry

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3051:
-
Target Version/s:   (was: 1.2.0)

> Support looking-up named accumulators in a registry
> ---
>
> Key: SPARK-3051
> URL: https://issues.apache.org/jira/browse/SPARK-3051
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Neil Ferguson
>
> This is a proposed enhancement to Spark based on the following mailing list 
> discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.
> This proposal builds on SPARK-2380 (Support displaying accumulator values in 
> the web UI) to allow named accumulables to be looked-up in a "registry", as 
> opposed to having to be passed to every method that need to access them.
> The use case was described well by [~shivaram], as follows:
> Lets say you have two functions you use 
> in a map call and want to measure how much time each of them takes. For 
> example, if you have a code block like the one below and you want to 
> measure how much time f1 takes as a fraction of the task. 
> {noformat}
> a.map { l => 
>val f = f1(l) 
>... some work here ... 
> } 
> {noformat}
> It would be really cool if we could do something like 
> {noformat}
> a.map { l => 
>val start = System.nanoTime 
>val f = f1(l) 
>TaskMetrics.get("f1-time").add(System.nanoTime - start) 
> } 
> {noformat}
> SPARK-2380 provides a partial solution to this problem -- however the 
> accumulables would still need to be passed to every function that needs them, 
> which I think would be cumbersome in any application of reasonable complexity.
> The proposal, as suggested by [~pwendell], is to have a "registry" of 
> accumulables, that can be looked-up by name. 
> Regarding the implementation details, I'd propose that we broadcast a 
> serialized version of all named accumulables in the DAGScheduler (similar to 
> what SPARK-2521 does for Tasks). These can then be deserialized in the 
> Executor. 
> Accumulables are already stored in thread-local variables in the Accumulators 
> object, so exposing these in the registry should be simply a matter of 
> wrapping this object, and keying the accumulables by name (they are currently 
> keyed by ID).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3059) Spark internal module interface design

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3059:
-
Target Version/s:   (was: 1.3.0)

> Spark internal module interface design
> --
>
> Key: SPARK-3059
> URL: https://issues.apache.org/jira/browse/SPARK-3059
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> An umbrella ticket to track various internal module interface designs & 
> implementations for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7393) How to improve Spark SQL performance?

2015-05-05 Thread Liang Lee (JIRA)
Liang Lee created SPARK-7393:


 Summary: How to improve Spark SQL performance?
 Key: SPARK-7393
 URL: https://issues.apache.org/jira/browse/SPARK-7393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7383) Python API for ml.feature

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7383:
-
Priority: Major  (was: Blocker)

> Python API for ml.feature
> -
>
> Key: SPARK-7383
> URL: https://issues.apache.org/jira/browse/SPARK-7383
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7388) Python Api for Param[Array[T]]

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7388:
-
Assignee: Burak Yavuz

> Python Api for Param[Array[T]]
> --
>
> Key: SPARK-7388
> URL: https://issues.apache.org/jira/browse/SPARK-7388
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> Python can't set Array[T] type params, because py4j casts a list to an 
> ArrayList. Instead of Param[Array[T]], we sill have a ArrayParam[T] which can 
> take a Seq[T].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7381) Python API for Transformers

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7381:
-
Priority: Major  (was: Blocker)

> Python API for Transformers
> ---
>
> Key: SPARK-7381
> URL: https://issues.apache.org/jira/browse/SPARK-7381
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7382) Python API for ml.classification

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7382:
-
Assignee: Burak Yavuz

> Python API for ml.classification
> 
>
> Key: SPARK-7382
> URL: https://issues.apache.org/jira/browse/SPARK-7382
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7382) Python API for ml.classification

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7382:
-
Priority: Major  (was: Blocker)

> Python API for ml.classification
> 
>
> Key: SPARK-7382
> URL: https://issues.apache.org/jira/browse/SPARK-7382
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7381) Python API for Transformers

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7381:
-
Assignee: Burak Yavuz

> Python API for Transformers
> ---
>
> Key: SPARK-7381
> URL: https://issues.apache.org/jira/browse/SPARK-7381
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7383) Python API for ml.feature

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7383:
-
Assignee: Burak Yavuz

> Python API for ml.feature
> -
>
> Key: SPARK-7383
> URL: https://issues.apache.org/jira/browse/SPARK-7383
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1920) Spark JAR compiled with Java 7 leads to PySpark not working in YARN

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1920:
---
Priority: Blocker  (was: Major)

> Spark JAR compiled with Java 7 leads to PySpark not working in YARN
> ---
>
> Key: SPARK-1920
> URL: https://issues.apache.org/jira/browse/SPARK-1920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>Priority: Blocker
>
> Current (Spark 1.0) implementation of PySpark on Yarn requires python to be 
> able to read Spark assembly JAR. But Spark assembly JAR compiled with Java 7 
> can sometimes be not readable by python. This can be due to the fact that 
> JARs created by Java 7 with more 2^16 files is encoded in Zip64, which python 
> cant read. 
> [SPARK-1911|https://issues.apache.org/jira/browse/SPARK-1911] warns users 
> from using Java 7 when creating Spark distribution. 
> One way to fix this is to put pyspark in a different smaller JAR than rest of 
> Spark so that it is readable by python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1920) Spark JAR compiled with Java 7 leads to PySpark not working in YARN

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1920:
---
Target Version/s: 1.4.0

> Spark JAR compiled with Java 7 leads to PySpark not working in YARN
> ---
>
> Key: SPARK-1920
> URL: https://issues.apache.org/jira/browse/SPARK-1920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.0.0
>Reporter: Tathagata Das
>Priority: Blocker
>
> Current (Spark 1.0) implementation of PySpark on Yarn requires python to be 
> able to read Spark assembly JAR. But Spark assembly JAR compiled with Java 7 
> can sometimes be not readable by python. This can be due to the fact that 
> JARs created by Java 7 with more 2^16 files is encoded in Zip64, which python 
> cant read. 
> [SPARK-1911|https://issues.apache.org/jira/browse/SPARK-1911] warns users 
> from using Java 7 when creating Spark distribution. 
> One way to fix this is to put pyspark in a different smaller JAR than rest of 
> Spark so that it is readable by python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7388) Python Api for Param[Array[T]]

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7388:
-
Priority: Major  (was: Blocker)

> Python Api for Param[Array[T]]
> --
>
> Key: SPARK-7388
> URL: https://issues.apache.org/jira/browse/SPARK-7388
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Burak Yavuz
>
> Python can't set Array[T] type params, because py4j casts a list to an 
> ArrayList. Instead of Param[Array[T]], we sill have a ArrayParam[T] which can 
> take a Seq[T].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-05-05 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530029#comment-14530029
 ] 

Guoqiang Li commented on SPARK-5556:


[FastLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553]:
 Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of 
topic (unique) in document d.  I recommend to be used for short text
[LightLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L763]
 Metropolis Hasting sampling The computational complexity is O(1)(It depends on 
the partition strategy and takes up more memory).


> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
between: added in 1.4
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
-toDeg  -> toDegrees-
-toRad -> toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String
{code}


*conditional functions*

{code}
if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T
{code}


*string functions*

{code}
ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, array): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): map
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string
{code}


*Misc*

{code}
hash(a1[, a2…]): int
{code}


*text*

{code}
context_ngrams(array>, array, int K, int pf): 
array>
ngrams(array>, int N, int K, int pf): array>
sentences(string str, string lang, string locale): array>
{code}


*UDAF*

{code}
var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: array
percentile_approx: array
histogram_numeric: array
collect_set  <— we have hashset
collect_list 
ntile
{code}







  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
-toDeg  -> toDegrees-
-toRad -> toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_

[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
-toDeg  -> toDegrees-
-toRad -> toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String
{code}


*conditional functions*

{code}
if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T
{code}


*string functions*

{code}
ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, array): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): map
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string
{code}


*Misc*

{code}
hash(a1[, a2…]): int
{code}


*text*

{code}
context_ngrams(array>, array, int K, int pf): 
array>
ngrams(array>, int N, int K, int pf): array>
sentences(string str, string lang, string locale): array>
{code}


*UDAF*

{code}
var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: array
percentile_approx: array
histogram_numeric: array
collect_set  <— we have hashset
collect_list 
ntile
{code}







  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
-toDeg  -> toDegrees-
-toRad -> toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*condi

[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
-toDeg  -> toDegrees-
-toRad -> toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T


*string functions*

ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, array): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): map
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string


*Misc*

hash(a1[, a2…]): int


*text*

context_ngrams(array>, array, int K, int pf): 
array>
ngrams(array>, int N, int K, int pf): array>
sentences(string str, string lang, string locale): array>


*UDAF*

var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: array
percentile_approx: array
histogram_numeric: array
collect_set  <— we have hashset
collect_list 
ntile








  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
toDeg  -> toDegrees
toRad -> toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, 

[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
toDeg  -> toDegrees
toRad -> toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T


*string functions*

ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, array): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): map
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string


*Misc*

hash(a1[, a2…]): int


*text*

context_ngrams(array>, array, int K, int pf): 
array>
ngrams(array>, int N, int K, int pf): array>
sentences(string str, string lang, string locale): array>


*UDAF*

var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: array
percentile_approx: array
histogram_numeric: array
collect_set  <— we have hashset
collect_list 
ntile








  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

between
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) -> binary
conv
pmod
factorial
toDeg  -> toDegrees
toRad -> toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(map): array
map_keys(map):array
array_contains(array, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, 

[jira] [Resolved] (SPARK-7369) Spark Python 1.3.1 Mllib dataframe random forest problem

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7369.
--
Resolution: Invalid

Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

This sounds, on its face, like a py4j issue. These kinds of things can be 
reopened if there is more specific evidence it's Spark-related.

> Spark Python 1.3.1 Mllib dataframe random forest problem
> 
>
> Key: SPARK-7369
> URL: https://issues.apache.org/jira/browse/SPARK-7369
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.1
>Reporter: Lisbeth Ron
>  Labels: hadoop
>
> I'm working with Dataframes to train a random forest with mllib
> and I have this error
>   File 
> "/opt/mapr/spark/spark-1.3.1-bin-mapr4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
> somebody can help me...???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7374) Error message when launching: "find: 'version' : No such file or directory"

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7374.
--
Resolution: Duplicate

Please search JIRA first and review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Error message when launching: "find: 'version' : No such file or directory"
> ---
>
> Key: SPARK-7374
> URL: https://issues.apache.org/jira/browse/SPARK-7374
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, PySpark, Spark Shell
>Affects Versions: 1.3.1
>Reporter: Stijn Geuens
>
> When launch spark-shell (or pyspark), I get the following message:
> find: 'version' : No such file or directory
> else was unexpected at this time.
> How is it possible that this error keeps occurring (with different versions 
> of Spark)? How can I resolve this issue?
> Thanks in advance,
> Stijn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7150) SQLContext.range()

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7150:
---
Summary: SQLContext.range()  (was: Facilitate random column generation for 
DataFrames)

> SQLContext.range()
> --
>
> Key: SPARK-7150
> URL: https://issues.apache.org/jira/browse/SPARK-7150
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> It would be handy to have easy ways to construct random columns for 
> DataFrames.  Proposed API:
> {code}
> class SQLContext {
>   // Return a DataFrame with a single column named "id" that has consecutive 
> value from 0 to n.
>   def range(n: Long): DataFrame
>   def range(n: Long, numPartitions: Int): DataFrame
> }
> {code}
> Usage:
> {code}
> // uniform distribution
> ctx.range(1000).select(rand())
> // normal distribution
> ctx.range(1000).select(randn())
> {code}
> We should add an RangeIterator that supports long start/stop position, and 
> then use it to create an RDD as the basis for this DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7151) Correlation methods for DataFrame

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7151.
--
  Resolution: Duplicate
   Fix Version/s: 1.4.0
Assignee: Burak Yavuz
Target Version/s: 1.4.0

> Correlation methods for DataFrame
> -
>
> Key: SPARK-7151
> URL: https://issues.apache.org/jira/browse/SPARK-7151
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Assignee: Burak Yavuz
>Priority: Minor
>  Labels: dataframe
> Fix For: 1.4.0
>
>
> We should support computing correlations between columns in DataFrames with a 
> simple API.
> This could be a DataFrame feature:
> {code}
> myDataFrame.corr("col1", "col2")
> // or
> myDataFrame.corr("col1", "col2", "pearson") // specify correlation type
> {code}
> Or it could be an MLlib feature:
> {code}
> Statistics.corr(myDataFrame("col1"), myDataFrame("col2"))
> // or
> Statistics.corr(myDataFrame, "col1", "col2")
> {code}
> (The first Statistics.corr option is more flexible, but it could cause 
> trouble if a user tries to pass in 2 unzippable DataFrame columns.)
> Note: R follow the latter setup.  I'm OK with either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7310) SparkSubmit does not escape & for java options and ^& won't work

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7310.
--
Resolution: Not A Problem

OK sounds like the particular escaping issue is no longer a problem as far as 
we can tell.

> SparkSubmit does not escape & for java options and ^& won't work
> 
>
> Key: SPARK-7310
> URL: https://issues.apache.org/jira/browse/SPARK-7310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Yitong Zhou
>Priority: Minor
>
> I can create the error when doing something like:
> {code}
> LIBJARS= /jars.../
> bin/spark-submit \
>  --driver-java-options "-Djob.url=http://www.foo.bar?query=a&b"; \
>  --class com.example.Class \
>  --master yarn-cluster \
>  --num-executors 3 \
>  --executor-cores 1 \
>  --queue default \
>  --driver-memory 1g \
>  --executor-memory 1g \
>  --jars $LIBJARS\
>  ../a.jar \
>  -inputPath /user/yizhou/CED-scoring/input \
>  -outputPath /user/yizhou
> {code}
> Notice that if I remove the "&" in "--driver-java-options" value, then the 
> submit will succeed. A typical error message looks like this:
> {code}
> org.apache.hadoop.util.Shell$ExitCodeException: Usage: java [-options] class 
> [args...]
>(to execute a class)
>or  java [-options] -jar jarfile [args...]
>(to execute a jar file)
> where options include:
> -d32use a 32-bit data model if available
> -d64use a 64-bit data model if available
> -server to select the "server" VM
>   The default VM is server,
>   because you are running on a server-class machine.
> -cp 
> -classpath 
>   A : separated list of directories, JAR archives,
>   and ZIP archives to search for class files.
> -D=
>   set a system property
> -verbose:[class|gc|jni]
>   enable verbose output
> -version  print product version and exit
> -version:
>   require the specified version to run
> -showversion  print product version and continue
> -jre-restrict-search | -no-jre-restrict-search
>   include/exclude user private JREs in the version search
> -? -help  print this help message
> -Xprint help on non-standard options
> -ea[:...|:]
> -enableassertions[:...|:]
>   enable assertions with specified granularity
> -da[:...|:]
> -disableassertions[:...|:]
>   disable assertions with specified granularity
> -esa | -enablesystemassertions
>   enable system assertions
> -dsa | -disablesystemassertions
>   disable system assertions
> -agentlib:[=]
>   load native agent library , e.g. -agentlib:hprof
>   see also, -agentlib:jdwp=help and -agentlib:hprof=help
> -agentpath:[=]
>   load native agent library by full pathname
> -javaagent:[=]
>   load Java programming language agent, see 
> java.lang.instrument
> -splash:
>   show splash screen with specified image
> See http://www.oracle.com/technetwork/java/javase/documentation/index.html 
> for more details.
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
>   at org.apache.hadoop.util.Shell.run(Shell.java:418)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.PepperdataContainerExecutor.launchContainer(PepperdataContainerExecutor.java:130)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6824) Fill the docs for DataFrame API in SparkR

2015-05-05 Thread Qian Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529997#comment-14529997
 ] 

Qian Huang commented on SPARK-6824:
---

start working on this issue

> Fill the docs for DataFrame API in SparkR
> -
>
> Key: SPARK-6824
> URL: https://issues.apache.org/jira/browse/SPARK-6824
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Priority: Blocker
>
> Some of the DataFrame functions in SparkR do not have complete roxygen docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7372) Multiclass SVM - One vs All wrapper

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7372.
--
Resolution: Won't Fix

This should be a question on user@ I think. It would better to build this once 
than specialize it several times. 

If there were some different, special way to handle mutliclass in SVM that took 
advantage of how SVMs worked, then it might make sense to support some 
SVM-specific implementation. (For example, you certainly don't need one-vs-all 
to do multiclass in decision trees.) But I don't believe there is for SVM.

> Multiclass SVM - One vs All wrapper
> ---
>
> Key: SPARK-7372
> URL: https://issues.apache.org/jira/browse/SPARK-7372
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Reporter: Renat Bekbolatov
>Priority: Trivial
>
> I was wondering if we want to have a some support for multiclass SVM in 
> MLlib, for example, through a simple wrapper over binary SVM classifiers with 
> OVA.
> There is already WIP for ML pipeline generalization: SparkSPARK-7015, 
> Multiclass to Binary Reduction
> However, if users prefer to just have basic OVA version that runs against 
> SVMWithSGD, they might be able to use it.
> Here is a code sketch: 
> https://github.com/Bekbolatov/spark/commit/463d73323d5f08669d5ae85dc9791b036637c966
> Maybe this could live in a 3rd party utility library (outside Spark MLlib).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7386) Spark application level metrics application.$AppName.$number.cores doesn't reset on Standalone Master deployment

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7386:
-
Component/s: Spark Core
   Priority: Minor  (was: Major)

Please set component.
I'm not familiar with this bit, but I recall a similar conversation about a 
cores metric where some metrics were intended to reflect the amount requested 
while the job was running. Is that the intent of this one?

> Spark application level metrics application.$AppName.$number.cores doesn't 
> reset on Standalone Master deployment
> 
>
> Key: SPARK-7386
> URL: https://issues.apache.org/jira/browse/SPARK-7386
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Bharat Venkat
>Priority: Minor
>
> Spark publishes a metric called application.$AppName.$number.cores that gets 
> published which monitors number of cores assigned to an application.  However 
> there is a bug as of 1.3 standalone deployment, where this metric doesn't go 
> down to 0 after the application ends.
> It looks like standalone master holds onto the old state and continues to 
> publish a stale metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4208) stack over flow error while using sqlContext.sql

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4208.
--
Resolution: Duplicate

> stack over flow error while using sqlContext.sql
> 
>
> Key: SPARK-4208
> URL: https://issues.apache.org/jira/browse/SPARK-4208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.1.0
> Environment: windows 7 , prebuilt spark-1.1.0-bin-hadoop2.3
>Reporter: milq
>  Labels: java, spark, sparkcontext, sql
>
> error happens when using sqlContext.sql
> 14/11/03 18:54:43 INFO BlockManager: Removing block broadcast_1
> 14/11/03 18:54:43 INFO MemoryStore: Block broadcast_1 of size 2976 dropped 
> from memory (free 28010260
> 14/11/03 18:54:43 INFO ContextCleaner: Cleaned broadcast 1
> root
>  |--  firstName : string (nullable = true)
>  |-- lastNameX: string (nullable = true)
> Exception in thread "main" java.lang.StackOverflowError
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7150) Facilitate random column generation for DataFrames

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7150:
---
Description: 
It would be handy to have easy ways to construct random columns for DataFrames. 
 Proposed API:
{code}
class SQLContext {
  // Return a DataFrame with a single column named "id" that has consecutive 
value from 0 to n.
  def range(n: Long): DataFrame

  def range(n: Long, numPartitions: Int): DataFrame
}
{code}

Usage:
{code}
// uniform distribution
ctx.range(1000).select(rand())

// normal distribution
ctx.range(1000).select(randn())
{code}


We should add an RangeIterator that supports long start/stop position, and then 
use it to create an RDD as the basis for this DataFrame.


  was:
It would be handy to have easy ways to construct random columns for DataFrames. 
 Proposed API:
{code}
class SQLContext {
  // Return a DataFrame with a single column named "id" that has consecutive 
value from 0 to n.
  def range(n: Long): DataFrame

  def range(n: Long, numPartitions: Int): DataFrame
}
{code}

Usage:
{code}
// uniform distribution
ctx.range(1000).select(rand())

// normal distribution
ctx.range(1000).select(randn())
{code}



> Facilitate random column generation for DataFrames
> --
>
> Key: SPARK-7150
> URL: https://issues.apache.org/jira/browse/SPARK-7150
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> It would be handy to have easy ways to construct random columns for 
> DataFrames.  Proposed API:
> {code}
> class SQLContext {
>   // Return a DataFrame with a single column named "id" that has consecutive 
> value from 0 to n.
>   def range(n: Long): DataFrame
>   def range(n: Long, numPartitions: Int): DataFrame
> }
> {code}
> Usage:
> {code}
> // uniform distribution
> ctx.range(1000).select(rand())
> // normal distribution
> ctx.range(1000).select(randn())
> {code}
> We should add an RangeIterator that supports long start/stop position, and 
> then use it to create an RDD as the basis for this DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6267) Python API for IsotonicRegression

2015-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6267.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5890
[https://github.com/apache/spark/pull/5890]

> Python API for IsotonicRegression
> -
>
> Key: SPARK-6267
> URL: https://issues.apache.org/jira/browse/SPARK-6267
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7150) Facilitate random column generation for DataFrames

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7150:
---
Description: 
It would be handy to have easy ways to construct random columns for DataFrames. 
 Proposed API:
{code}
class SQLContext {
  // Return a DataFrame with a single column named "id" that has consecutive 
value from 0 to n.
  def range(n: Long): DataFrame

  def range(n: Long, numPartitions: Int): DataFrame
}
{code}

Usage:
{code}
// uniform distribution
ctx.range(1000).select(rand())

// normal distribution
ctx.range(1000).select(randn())
{code}


  was:
It would be handy to have easy ways to construct random columns for DataFrames. 
 Proposed API:
{code}
class SQLContext {
  // Return a DataFrame with a single column named "id" that has consecutive 
value from 0 to n.
  def range(n: Long): DataFrame
}
{code}

Usage:
{code}
// uniform distribution
ctx.range(1000).select(rand())

// normal distribution
ctx.range(1000).select(randn())
{code}



> Facilitate random column generation for DataFrames
> --
>
> Key: SPARK-7150
> URL: https://issues.apache.org/jira/browse/SPARK-7150
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> It would be handy to have easy ways to construct random columns for 
> DataFrames.  Proposed API:
> {code}
> class SQLContext {
>   // Return a DataFrame with a single column named "id" that has consecutive 
> value from 0 to n.
>   def range(n: Long): DataFrame
>   def range(n: Long, numPartitions: Int): DataFrame
> }
> {code}
> Usage:
> {code}
> // uniform distribution
> ctx.range(1000).select(rand())
> // normal distribution
> ctx.range(1000).select(randn())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7358) Move mathfunctions into functions

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7358.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Move mathfunctions into functions
> -
>
> Key: SPARK-7358
> URL: https://issues.apache.org/jira/browse/SPARK-7358
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 1.4.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1437) Jenkins should build with Java 6

2015-05-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529982#comment-14529982
 ] 

Sean Owen commented on SPARK-1437:
--

Heh, so now that we're all going to Java 7, I don't think we need to actually 
implement this, except possibly in builds for 1.4 and earlier? master builds 
and PRs will now (continue to) use Java 7

> Jenkins should build with Java 6
> 
>
> Key: SPARK-1437
> URL: https://issues.apache.org/jira/browse/SPARK-1437
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: shane knapp
>Priority: Minor
>  Labels: javac, jenkins
> Attachments: Screen Shot 2014-04-07 at 22.53.56.png
>
>
> Apologies if this was already on someone's to-do list, but I wanted to track 
> this, as it bit two commits in the last few weeks.
> Spark is intended to work with Java 6, and so compiles with source/target 
> 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 
> bytecode. However, unless otherwise configured with -bootclasspath, javac 
> will use its own (Java 7) library classes. This means code that uses classes 
> in Java 7 will be allowed to compile, but the result will fail when run on 
> Java 6.
> This is why you get warnings like ...
> Using /usr/java/jdk1.7.0_51 as default JAVA_HOME.
> ...
> [warn] warning: [options] bootstrap class path not set in conjunction with 
> -source 1.6
> The solution is just to tell Jenkins to use Java 6. This may be stating the 
> obvious, but it should just be a setting under "Configure" for 
> SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not 
> an option already I'm guessing it's not too hard to get Java 6 configured on 
> the Amplab machines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5041) hive-exec jar should be generated with JDK 6

2015-05-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5041.
--
Resolution: Won't Fix

Yeah, and this was a Hive artifact anyway.

> hive-exec jar should be generated with JDK 6
> 
>
> Key: SPARK-5041
> URL: https://issues.apache.org/jira/browse/SPARK-5041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ted Yu
>  Labels: jdk1.7, maven
>
> Shixiong Zhu first reported the issue where hive-exec-0.12.0-protobuf-2.5.jar 
> cannot be used by Spark program running JDK 6.
> See http://search-hadoop.com/m/JW1q5YLCNN
> hive-exec-0.12.0-protobuf-2.5.jar was generated with JDK 7. It should be 
> generated with JDK 6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7392) Kryo buffer size can not be larger than 2M

2015-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7392:
---

Assignee: (was: Apache Spark)

> Kryo buffer size can not be larger than 2M
> --
>
> Key: SPARK-7392
> URL: https://issues.apache.org/jira/browse/SPARK-7392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Zhang, Liye
>Priority: Critical
>
> when set *spark.kryoserializer.buffer* larger than 2048k, 
> *IllegalArgumentException* will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7392) Kryo buffer size can not be larger than 2M

2015-05-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529946#comment-14529946
 ] 

Apache Spark commented on SPARK-7392:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/5934

> Kryo buffer size can not be larger than 2M
> --
>
> Key: SPARK-7392
> URL: https://issues.apache.org/jira/browse/SPARK-7392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Zhang, Liye
>Priority: Critical
>
> when set *spark.kryoserializer.buffer* larger than 2048k, 
> *IllegalArgumentException* will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7392) Kryo buffer size can not be larger than 2M

2015-05-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7392:
---

Assignee: Apache Spark

> Kryo buffer size can not be larger than 2M
> --
>
> Key: SPARK-7392
> URL: https://issues.apache.org/jira/browse/SPARK-7392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Zhang, Liye
>Assignee: Apache Spark
>Priority: Critical
>
> when set *spark.kryoserializer.buffer* larger than 2048k, 
> *IllegalArgumentException* will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7392) Kryo buffer size can not be larger than 2M

2015-05-05 Thread Zhang, Liye (JIRA)
Zhang, Liye created SPARK-7392:
--

 Summary: Kryo buffer size can not be larger than 2M
 Key: SPARK-7392
 URL: https://issues.apache.org/jira/browse/SPARK-7392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Zhang, Liye
Priority: Critical


when set *spark.kryoserializer.buffer* larger than 2048k, 
*IllegalArgumentException* will be thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected

2015-05-05 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529935#comment-14529935
 ] 

Shivaram Venkataraman commented on SPARK-6812:
--

Ah I see - We ran into a similar issue with `head` before and the workaround 
was to include utils before SparkR -- See R/pkg/inst/profile/shell.R 
We could do a similar fix for stats

> filter() on DataFrame does not work as expected
> ---
>
> Key: SPARK-6812
> URL: https://issues.apache.org/jira/browse/SPARK-6812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Davies Liu
>Assignee: Sun Rui
>Priority: Blocker
>
> {code}
> > filter(df, df$age > 21)
> Error in filter(df, df$age > 21) :
>   no method for coercing this S4 class to a vector
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected

2015-05-05 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529930#comment-14529930
 ] 

Sun Rui commented on SPARK-6812:


Interestingly, we have a unit test case for filter() and the test passes. In R, 
if multiple packages have a same name, the name in the package loaded lastly 
overwrites that in the packages loaded before. 
If you use bin/sparkR to start a SparkR shell, the environment list is as 
follows:
 [1] ".GlobalEnv""package:stats" "package:graphics"
 [4] "package:grDevices" "package:datasets"  "package:SparkR"
 [7] "package:utils" "package:methods"   "Autoloads"
[10] "package:base"

You can see that "package:stats" is before "package:SparkR", so its filter() 
function overwrites the one in SparkR.

While in the test procedure, the environment list is different:
.GlobalEnv package:plyr package:SparkR package:testthat package:methods 
package:stats package:graphics package:grDevices package:utils package:datasets 
Autoloads package:base

You can see that package:SparkR is before package:stats. That why filter() in 
SparkR passes the test.

Don't know why the package loading order is different now.

> filter() on DataFrame does not work as expected
> ---
>
> Key: SPARK-6812
> URL: https://issues.apache.org/jira/browse/SPARK-6812
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Davies Liu
>Assignee: Sun Rui
>Priority: Blocker
>
> {code}
> > filter(df, df$age > 21)
> Error in filter(df, df$age > 21) :
>   no method for coercing this S4 class to a vector
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7248) Random number generators for DataFrames

2015-05-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7248:
---
Fix Version/s: (was: 1.4.0)

> Random number generators for DataFrames
> ---
>
> Key: SPARK-7248
> URL: https://issues.apache.org/jira/browse/SPARK-7248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Burak Yavuz
> Fix For: 1.4.0
>
>
> This is an umbrella JIRA for random number generators for DataFrames. The 
> initial set of RNGs would be `rand` and `randn`, which takes a seed.
> {code}
> df.select("*", rand(11L).as("rand"))
> {code}
> Where those methods should live is TBD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >