[jira] [Created] (SPARK-10918) Task failed because executor kill by driver

2015-10-03 Thread Hong Shen (JIRA)
Hong Shen created SPARK-10918:
-

 Summary: Task failed because executor kill by driver
 Key: SPARK-10918
 URL: https://issues.apache.org/jira/browse/SPARK-10918
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Hong Shen


When dynamicAllocation is enabled, when a executor was idle timeout, it will be 
kill by driver, if a task offer to the executor at the same time, the task will 
failed due to executor lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10912) Improve Spark metrics executor.filesystem

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10912:
--
   Priority: Minor  (was: Major)
Component/s: Deploy

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10780:


Assignee: Apache Spark

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10846) Stray META-INF in directory spark-shell is launched from causes problems

2015-10-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942250#comment-14942250
 ] 

Sean Owen commented on SPARK-10846:
---

I looked a little more and yes this is easy to reproduce. I verified that Java 
/ ServiceLoader doesn't by nature examine the working directory. So, something 
is causing the search to include the current working directory. But my 
driver/executor classpath in a simple local run don't include the cwd on the 
classpath.

Of course the workaround is just to not do that, not have META-INF lying 
around. Still I do not know how the cwd ends up on the classpath in this 
situation.

> Stray META-INF in directory spark-shell is launched from causes problems
> 
>
> Key: SPARK-10846
> URL: https://issues.apache.org/jira/browse/SPARK-10846
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I observed some perplexing errors while running 
> {{$SPARK_HOME/bin/spark-shell}} yesterday (with {{$SPARK_HOME}} pointing at a 
> clean 1.5.0 install):
> {code}
> java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: 
> Provider org.apache.hadoop.fs.s3.S3FileSystem not found
> {code}
> while initializing {{HiveContext}}; full example output is 
> [here|https://gist.github.com/ryan-williams/34210ad640687113e5c3#file-1-5-0-failure].
> The issue was that a stray {{META-INF}} directory from some other project I'd 
> built months ago was sitting in the directory that I'd run {{spark-shell}} 
> from (*not* in my {{$SPARK_HOME}}, just in the directory I happened to be in 
> when I ran {{$SPARK_HOME/bin/spark-shell}}). 
> That {{META-INF}} had a {{services/org.apache.hadoop.fs.FileSystem}} file 
> specifying some provider classes ({{S3FileSystem}} in the example above) that 
> were unsurprisingly not resolvable by Spark.
> I'm not sure if this is purely my fault for attempting to run Spark from a 
> directory with another project's config files laying around, but I find it 
> somewhat surprising that, given a {{$SPARK_HOME}} pointing to a clean Spark 
> install, that {{$SPARK_HOME/bin/spark-shell}} picks up detritus from the 
> {{cwd}} it is called from, so I wanted to at least document it here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10476) Enable common RDD operations on standard Scala collections

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10476.
---
Resolution: Won't Fix

> Enable common RDD operations on standard Scala collections
> --
>
> Key: SPARK-10476
> URL: https://issues.apache.org/jira/browse/SPARK-10476
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Simeon Simeonov
>Priority: Minor
>  Labels: core, mapPartitions, rdd
>
> A common pattern in Spark development is to look for opportunities to 
> leverage data locality using mechanisms such as {{mapPartitions}}. Often this 
> happens when an existing set of RDD transformations is refactored to improve 
> performance. At that point, significant code refactoring may be required 
> because the input is {{Iterator\[T]}} as opposed to an RDD. The most common 
> examples we've encountered so far involve the {{*ByKey}} methods, {{sample}} 
> and {{takeSample}}. We have also observed cases where, due to changes in the 
> structure of data use of {{mapPartitions}} is no longer possible and the code 
> has to be converted to use the RDD API.
> If data manipulation through the RDD API could be applied to the standard 
> Scala data structures then refactoring Spark data pipelines would become 
> faster and less bug-prone. Also, and this is no small benefit, the 
> thoughtfulness and experience of the Spark community could spread to the 
> broader Scala community.
> There are multiple approaches to solving this problem, including but not 
> limited to creating a set of {{Local*RDD}} classes and/or adding implicit 
> conversions.
> Here is a simple example meant to be short as opposed to complete or 
> performance-optimized:
> {code}
> implicit class LocalRDD[T](it: Iterator[T]) extends Iterable[T] {
>   def this(collection: Iterable[T]) = this(collection.toIterator)
>   def iterator = it
> }
> implicit class LocalPairRDD[K, V](it: Iterator[(K, V)]) extends Iterable[(K, 
> V)] {
>   def this(collection: Iterable[(K, V)]) = this(collection.toIterator)
>   def iterator = it
>   def groupByKey() = new LocalPairRDD[K, Iterable[V]](
> groupBy(_._1).map { case (k, valuePairs) => (k, valuePairs.map(_._2)) }
>   )
> }
> sc.
>   parallelize(Array((1, 10), (2, 10), (1, 20))).
>   repartition(1).
>   mapPartitions(data => data.groupByKey().toIterator).
>   take(2)
> // Array[(Int, Iterable[Int])] = Array((2,List(10)), (1,List(10, 20)))
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10889) Upgrade Kinesis Client Library

2015-10-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942252#comment-14942252
 ] 

Sean Owen commented on SPARK-10889:
---

I'm reluctant to push back a minor version bump into a maintenance release, not 
knowing much about the change. Still I expect we'd find there's no problem.

> Upgrade Kinesis Client Library
> --
>
> Key: SPARK-10889
> URL: https://issues.apache.org/jira/browse/SPARK-10889
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Avrohom Katz
>Priority: Minor
>
> Kinesis Client Library added a custom cloudwatch metric in 1.3.0 called 
> MillisBehindLatest. This is very important for capacity planning and alerting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942217#comment-14942217
 ] 

Apache Spark commented on SPARK-10780:
--

User 'jayantshekhar' has created a pull request for this issue:
https://github.com/apache/spark/pull/8972

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10780:


Assignee: (was: Apache Spark)

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10189:
--
Target Version/s:   (was: 1.4.1)

> python rdd socket connection problem
> 
>
> Key: SPARK-10189
> URL: https://issues.apache.org/jira/browse/SPARK-10189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: ABHISHEK CHOUDHARY
>  Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the 
> same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 1277, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py",
>  line 898, in runJob
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 138, in _load_from_socket
> raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> try:
> sock = socket.socket(af, socktype, proto)
> sock.settimeout(3)
> sock.connect(sa)
> except socket.error:
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> try:
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is 
> not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10805) JSON Data Frame does not return correct string lengths

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10805:
--
Target Version/s:   (was: 1.4.1)
Priority: Minor  (was: Critical)

> JSON Data Frame does not return correct string lengths
> --
>
> Key: SPARK-10805
> URL: https://issues.apache.org/jira/browse/SPARK-10805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jeff Li
>Priority: Minor
>
> Here is the sample code to run the test 
> @Test
>   public void runSchemaTest() throws Exception {
>   DataFrame jsonDataFrame = 
> sqlContext.jsonFile("src/test/resources/jsontransform/json.sampledata.json");
>   jsonDataFrame.printSchema();
>   StructType jsonSchema = jsonDataFrame.schema();
>   StructField[] dataFields = jsonSchema.fields();
>   for ( int fieldIndex = 0; fieldIndex < dataFields.length;  
> fieldIndex++) {
>   StructField aField = dataFields[fieldIndex];
>   DataType aType = aField.dataType();
>   System.out.println("name: " + aField.name() + " type: " 
> + aType.typeName()
>   + " size: " +aType.defaultSize());
>   }
>  }
> name: _id type: string size: 4096
> name: firstName type: string size: 4096
> name: lastName type: string size: 4096
> In my case, the _id: 1 character, first name: 4 characters, and last name: 7 
> characters). 
> The Spark JSON Data frame should have a way to tell the maximum length of 
> each JSON String elements in the JSON document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10189) python rdd socket connection problem

2015-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10189.
---
Resolution: Not A Problem

> python rdd socket connection problem
> 
>
> Key: SPARK-10189
> URL: https://issues.apache.org/jira/browse/SPARK-10189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: ABHISHEK CHOUDHARY
>  Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the 
> same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 1277, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py",
>  line 898, in runJob
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 138, in _load_from_socket
> raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> try:
> sock = socket.socket(af, socktype, proto)
> sock.settimeout(3)
> sock.connect(sa)
> except socket.error:
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> try:
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is 
> not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-03 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942345#comment-14942345
 ] 

Russell Pierce commented on SPARK-9325:
---

>From an outside perspective, I'll +1 this.

In an earlier version of SparkR this wasn't supported... and its absence seemed 
ridiculous to me.  It didn't help that errors were all the same leaving the 
source of the error unclear to me (thank goodness it is resolved: 
https://issues.apache.org/jira/browse/SPARK-8742).  In my view as someone who 
programs heavily in R, this is a key feature for SparkR to be of use to me.  In 
my use case (again not leveraging much of the power of Spark but leveraging my 
existing skills), I need Spark to provide a large back-end cached data 
warehouse that I can then subset to pull workable size pieces into R and do 
arbitrary processing.  If you give a user like me collect(Column), then there 
is no need for you to give me count(Column), sum(Column), Ave(Column) etc - 
I'll do whatever processing I still need done in R.  If I /really/ need to do 
it for the entire frame, I can just combine the results of my subsets (in R); 
but in that case I'm going to have to thrash out to disk for everything anyway 
and may just opt to do it via Hadoop/HIVE.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-03 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942556#comment-14942556
 ] 

Russell Pierce commented on SPARK-9325:
---

I'll try again in 1.5 in a few days and see.  I took Felix's comment on from 
14/Sep/15 03:50 to mean that line doesn't behave as expected... and that the 
problem wasn't just with me or my data source - but maybe I'm wrong.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-03 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-10669:

Comment: was deleted

(was: Hi Joseph,

I'm just double checking with you that I'm doing it right. I just made a test 
change below by adding a external link, could you please confirm I'm doing the 
right thing?
https://github.com/keypointt/spark/commit/f8289891d5b32fffdc6a4ce077d8d206e015119f

Also I'm not quite sure what you mean by "codetabs"? I see some terms are 
linking to Wikipedia, and some to Spark internal files. Could you please give 
me a quick example of this?

As I understand, for example, "ChiSqSelector" section in 
https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md,
 add Wikipedia link to "Feature selection".

As I understand, all below .md file should be modified to be linking to API in 
a consistent way?
* mllib-classification-regression.md
* mllib-clustering.md   
* mllib-collaborative-filtering.md  
* mllib-data-types.md   
* mllib-decision-tree.md
* mllib-dimensionality-reduction.md 
* mllib-ensembles.md
* mllib-evaluation-metrics.md   
* mllib-feature-extraction.md   
* mllib-frequent-pattern-mining.md  
* mllib-guide.md
* mllib-isotonic-regression.md  
* mllib-linear-methods.md
* mllib-migration-guides.md
* mllib-naive-bayes.md
* mllib-optimization.md
* mllib-pmml-model-export.md
* mllib-statistics.md

Thank you very much! :))

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942535#comment-14942535
 ] 

Reynold Xin commented on SPARK-9325:


But isn't this just

collect(select(df, df$col)) ?

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10342) Cooperative memory management

2015-10-03 Thread FangzhouXing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942475#comment-14942475
 ] 

FangzhouXing commented on SPARK-10342:
--

We are interested in working on this issue. Do you have any suggestions on 
where to start?

Sorry for all the questions, this is my first time contributing to Spark.

> Cooperative memory management
> -
>
> Key: SPARK-10342
> URL: https://issues.apache.org/jira/browse/SPARK-10342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Priority: Critical
>
> We have memory starving problems for a long time, it become worser in 1.5 
> since we use larger page.
> In order to increase the memory usage (reduce unnecessary spilling) also 
> reduce the risk of OOM, we should manage the memory in a cooperative way, it 
> means all the memory consume should be also responsive to release memory 
> (spilling) upon others' requests.
> The requests of memory could be different, hard requirement (will crash if 
> not allocated) or soft requirement (worse performance if not allocated). Also 
> the costs of spilling are also different. We could introduce some kind of 
> priority to make them work together better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-10-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942483#comment-14942483
 ] 

Josh Rosen commented on SPARK-6270:
---

To my knowledge, this has _not_ been fixed in 1.5.0 / 1.5.1.

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-03 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942499#comment-14942499
 ] 

Xin Ren commented on SPARK-10669:
-

I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-03 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942499#comment-14942499
 ] 

Xin Ren edited comment on SPARK-10669 at 10/3/15 11:33 PM:
---

I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

And links in this MD file:

https://github.com/apache/spark/blob/master/docs/api.md


was (Author: iamshrek):
I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10904.
---
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 8961
[https://github.com/apache/spark/pull/8961]

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
> Fix For: 1.6.0, 1.5.2
>
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10904) select(df, c("col1", "col2")) fails

2015-10-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-10904:
--
Assignee: Felix Cheung

>   select(df, c("col1", "col2")) fails
> -
>
> Key: SPARK-10904
> URL: https://issues.apache.org/jira/browse/SPARK-10904
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Weiqiang Zhuang
>Assignee: Felix Cheung
> Fix For: 1.5.2, 1.6.0
>
>
> The help page for 'select' gives an example of 
>   select(df, c("col1", "col2"))
> However, this fails with assertion:
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:92)
>   at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:99)
>   at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:63)
>   at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:52)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:182)
>   at 
> org.apache.spark.api.r.RBackendHandler$$anonfun$readArgs$1.apply(RBackendHandler.scala:181)
> And then none of the functions will work with following error:
> > head(df)
>  Error in if (returnStatus != 0) { : argument is of length zero 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-03 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942499#comment-14942499
 ] 

Xin Ren edited comment on SPARK-10669 at 10/3/15 11:50 PM:
---

I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

And links in this MD file:

https://github.com/apache/spark/blob/master/docs/api.md

should I create a separate ticket for the invalid links ? in 
https://github.com/apache/spark/blob/master/docs/api.md


was (Author: iamshrek):
I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

And links in this MD file:

https://github.com/apache/spark/blob/master/docs/api.md

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-03 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942499#comment-14942499
 ] 

Xin Ren edited comment on SPARK-10669 at 10/4/15 12:31 AM:
---

Oh sorry I didn't run "jekyll build". 

Now everything works fine, and I'm working on adding api links. :P


was (Author: iamshrek):
I just found all the links to "/docs/api/" folder is invalid now. It seems 
folder "/docs/api/" is removed on master branch.

https://github.com/apache/spark/blob/master/docs/api 

And links in this MD file:

https://github.com/apache/spark/blob/master/docs/api.md

should I create a separate ticket for the invalid links ? in 
https://github.com/apache/spark/blob/master/docs/api.md

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7275) Make LogicalRelation public

2015-10-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7275.

   Resolution: Fixed
 Assignee: Glenn Weidner
Fix Version/s: 1.6.0

I've merged it - but please do keep in mind that this is internal and can be 
moved around and changed with every new release of Spark.


> Make LogicalRelation public
> ---
>
> Key: SPARK-7275
> URL: https://issues.apache.org/jira/browse/SPARK-7275
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Santiago M. Mola
>Assignee: Glenn Weidner
>Priority: Minor
> Fix For: 1.6.0
>
>
> It seems LogicalRelation is the only part of the LogicalPlan that is not 
> public. This makes it harder to work with full logical plans from third party 
> packages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org