[jira] [Commented] (SPARK-10644) Applications wait even if free executors are available

2015-09-20 Thread Madhusudanan Kandasamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900277#comment-14900277
 ] 

Madhusudanan Kandasamy commented on SPARK-10644:


Standalone cluster manager launches executors during application's 
registration. So when you say Number of executors: 63, do you mean total number 
of cores available across all the workers?

Can you give the values for spark.executor.cores and spark.cores.max. 

> Applications wait even if free executors are available
> --
>
> Key: SPARK-10644
> URL: https://issues.apache.org/jira/browse/SPARK-10644
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0
> Environment: RHEL 6.5 64 bit
>Reporter: Balagopal Nair
>
> Number of workers: 21
> Number of executors: 63
> Steps to reproduce:
> 1. Run 4 jobs each with max cores set to 10
> 2. The first 3 jobs run with 10 each. (30 executors consumed so far)
> 3. The 4 th job waits even though there are 33 idle executors.
> The reason is that a job will not get executors unless 
> the total number of EXECUTORS in use < the number of WORKERS
> If there are executors available, resources should be allocated to the 
> pending job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10724:
--
Assignee: (was: Xiangrui Meng)

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-10724:
-

Assignee: Xiangrui Meng

> SQL's floor() returns DOUBLE
> 
>
> Key: SPARK-10724
> URL: https://issues.apache.org/jira/browse/SPARK-10724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Xiangrui Meng
>Priority: Critical
>  Labels: sql
>
> This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 
> {code}
> scala> sql("select floor(1)").printSchema
> root
>  |-- _c0: double (nullable = true)
> {code}
> In the [Hive Language 
> Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
> {{floor}} is defined to return BIGINT.
> This is a significant issue because it changes the DataFrame schema.
> I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10631.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8834
[https://github.com/apache/spark/pull/8834]

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 1.6.0
>
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Created] (SPARK-10724) SQL's floor() returns DOUBLE

2015-09-20 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10724:
---

 Summary: SQL's floor() returns DOUBLE
 Key: SPARK-10724
 URL: https://issues.apache.org/jira/browse/SPARK-10724
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Simeon Simeonov
Priority: Critical


This is a change in behavior from 1.4.1 where {{floor}} returns a BIGINT. 

{code}
scala> sql("select floor(1)").printSchema
root
 |-- _c0: double (nullable = true)
{code}

In the [Hive Language 
Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF] 
{{floor}} is defined to return BIGINT.

This is a significant issue because it changes the DataFrame schema.

I wonder what caused this and whether other SQL functions are affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2015-09-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900204#comment-14900204
 ] 

Reynold Xin commented on SPARK-8000:


Yes - sounds good.

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10694) Prevent Data Loss in Spark Streaming when used with OFF_HEAP ExternalBlockStore (Tachyon)

2015-09-20 Thread Dibyendu Bhattacharya (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dibyendu Bhattacharya updated SPARK-10694:
--
Component/s: Block Manager

> Prevent Data Loss in Spark Streaming when used with OFF_HEAP 
> ExternalBlockStore (Tachyon)
> -
>
> Key: SPARK-10694
> URL: https://issues.apache.org/jira/browse/SPARK-10694
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Streaming
>Affects Versions: 1.5.0
>Reporter: Dibyendu Bhattacharya
>
> If Streaming application stores the blocks OFF_HEAP, it may not need any WAL 
> like feature to recover from Driver failure. As long as the writing of blocks 
> to Tachyon from Streaming receiver is durable, it should be recoverable from 
> Tachyon directly on Driver failure. 
> This can solve the issue of expensive WAL write and duplicating the blocks 
> both in MEMORY and also WAL and also guarantee end to end No-Data-Loss 
> channel using OFF_HEAP store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10723) Add RDD.reduceOption method

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10723:


Assignee: (was: Apache Spark)

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10723) Add RDD.reduceOption method

2015-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900184#comment-14900184
 ] 

Apache Spark commented on SPARK-10723:
--

User 'Attsun1031' has created a pull request for this issue:
https://github.com/apache/spark/pull/8845

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10723) Add RDD.reduceOption method

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10723:


Assignee: Apache Spark

> Add RDD.reduceOption method
> ---
>
> Key: SPARK-10723
> URL: https://issues.apache.org/jira/browse/SPARK-10723
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Tatsuya Atsumi
>Assignee: Apache Spark
>Priority: Minor
>
> h2. Problem
> RDD.reduce throws exception if the RDD is empty.
> It is appropriate behavior if RDD is expected to be not empty, but if it is 
> not sure until runtime that the RDD is empty or not, it needs to wrap with 
> try-catch to call reduce safely. 
> Example Code
> {code}
> // This RDD may be empty or not
> val rdd: RDD[Int] = originalRdd.filter(_ > 10)
> val reduced: Option[Int] = try {
>   Some(rdd.reduce(_ + _))
> } catch {
>   // if rdd is empty return None.
>   case e:UnsupportedOperationException => None
> }
> {code}
> h2. Improvement idea
> Scala’s List has reduceOption method, which returns None if List is empty.
> If RDD has reduceOption API like Scala’s List, it will become easy to handle 
> above case.
> Example Code
> {code}
> val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2015-09-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900172#comment-14900172
 ] 

Yanbo Liang commented on SPARK-8000:


[~rxin] I will work on it.
I agree to make Spark SQL write an output metadata file.
But if the data is produced by other framework which did not have such 
metadata, it will not work well. Does this in line with expectation?

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10602) Univariate statistics as UDAFs: single-pass continuous stats

2015-09-20 Thread Sabyasachi Nayak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900166#comment-14900166
 ] 

Sabyasachi Nayak commented on SPARK-10602:
--

Hey Seth,Can you pls share the working versions of single pass algos for 
skewness and kurtosis what you have?

Thanks,
Sabs

> Univariate statistics as UDAFs: single-pass continuous stats
> 
>
> Key: SPARK-10602
> URL: https://issues.apache.org/jira/browse/SPARK-10602
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>
> See parent JIRA for more details.  This subtask covers statistics for 
> continuous values requiring a single pass over the data, such as min and max.
> This JIRA is an umbrella.  For individual stats, please create and link a new 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10723) Add RDD.reduceOption method

2015-09-20 Thread Tatsuya Atsumi (JIRA)
Tatsuya Atsumi created SPARK-10723:
--

 Summary: Add RDD.reduceOption method
 Key: SPARK-10723
 URL: https://issues.apache.org/jira/browse/SPARK-10723
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Reporter: Tatsuya Atsumi
Priority: Minor


h2. Problem
RDD.reduce throws exception if the RDD is empty.
It is appropriate behavior if RDD is expected to be not empty, but if it is not 
sure until runtime that the RDD is empty or not, it needs to wrap with 
try-catch to call reduce safely. 

Example Code
{code}
// This RDD may be empty or not
val rdd: RDD[Int] = originalRdd.filter(_ > 10)

val reduced: Option[Int] = try {
  Some(rdd.reduce(_ + _))
} catch {
  // if rdd is empty return None.
  case e:UnsupportedOperationException => None
}
{code}

h2. Improvement idea
Scala’s List has reduceOption method, which returns None if List is empty.
If RDD has reduceOption API like Scala’s List, it will become easy to handle 
above case.

Example Code
{code}
val reduced: Option[Int] = originalRdd.filter(_ > 10).reduceOption(_ + _)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10722) Uncaught exception: RDDBlockId not found in driver-heartbeater

2015-09-20 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-10722:
---

 Summary: Uncaught exception: RDDBlockId not found in 
driver-heartbeater
 Key: SPARK-10722
 URL: https://issues.apache.org/jira/browse/SPARK-10722
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.1, 1.3.1
Reporter: Simeon Simeonov


Some operations involving cached RDDs generate an uncaught exception in 
driver-heartbeater. If the {{.cache()}} call is removed, processing happens 
without the exception. However, not all RDDs trigger the problem, i.e., some 
{{.cache()}} operations are fine. 

I can see the problem with 1.4.1 and 1.5.0 but I have not been able to create a 
reproducible test case. The same exception is [reported on 
SO|http://stackoverflow.com/questions/31280355/spark-test-on-local-machine] for 
v1.3.1 but the behavior is related to large broadcast variables.

The full stack trace is:

{code}
15/09/20 22:10:08 ERROR Utils: Uncaught exception in thread driver-heartbeater
java.io.IOException: java.lang.ClassNotFoundException: 
org.apache.spark.storage.RDDBlockId
  at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)
  at org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at org.apache.spark.util.Utils$.deserialize(Utils.scala:91)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:440)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:430)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:430)
  at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:428)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:428)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
  at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
  at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
  at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:270)
  at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
  at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
  at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.defaultRead

[jira] [Comment Edited] (SPARK-3255) Faster algorithms for logistic regression

2015-09-20 Thread Henry Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900114#comment-14900114
 ] 

Henry Lin edited comment on SPARK-3255 at 9/20/15 11:41 PM:


This paper here...
http://jmlr.csail.mit.edu/proceedings/papers/v28/gopal13.pdf
... has some insight. The paper experiments with different optimization methods 
for distributed training, and determines that a Log-concavity bound, discovered 
by David Blei and John Lafferty (2006), scales the best on large datasets.


was (Author: hlin117):
This paper http://jmlr.csail.mit.edu/proceedings/papers/v28/gopal13.pdf";>here 
has some insight. The paper experiments with different optimization methods for 
distributed training, and determines that a Log-concavity bound, discovered by 
David Blei and John Lafferty (2006), scales the best on large datasets.

> Faster algorithms for logistic regression
> -
>
> Key: SPARK-3255
> URL: https://issues.apache.org/jira/browse/SPARK-3255
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Logistic regression is perhaps the most widely used classification algorithm 
> in industry. We are looking for faster and scalable algorithms for MLlib. We 
> currently have LogisticRegressionWithLBFGS, and the LIBLINEAR group 
> implemented Spark LIBLINEAR: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/running_spark_liblinear.html
> Welcome to join the discussion and add more candidates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3255) Faster algorithms for logistic regression

2015-09-20 Thread Henry Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900114#comment-14900114
 ] 

Henry Lin commented on SPARK-3255:
--

This paper http://jmlr.csail.mit.edu/proceedings/papers/v28/gopal13.pdf";>here 
has some insight. The paper experiments with different optimization methods for 
distributed training, and determines that a Log-concavity bound, discovered by 
David Blei and John Lafferty (2006), scales the best on large datasets.

> Faster algorithms for logistic regression
> -
>
> Key: SPARK-3255
> URL: https://issues.apache.org/jira/browse/SPARK-3255
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Logistic regression is perhaps the most widely used classification algorithm 
> in industry. We are looking for faster and scalable algorithms for MLlib. We 
> currently have LogisticRegressionWithLBFGS, and the LIBLINEAR group 
> implemented Spark LIBLINEAR: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/distributed-liblinear/spark/running_spark_liblinear.html
> Welcome to join the discussion and add more candidates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9681) Support R feature interactions in RFormula

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9681:
-
Shepherd: Xiangrui Meng

> Support R feature interactions in RFormula
> --
>
> Key: SPARK-9681
> URL: https://issues.apache.org/jira/browse/SPARK-9681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Eric Liang
>Assignee: Eric Liang
>
> Support the interaction (":") operator RFormula feature transformer, so that 
> it is available for use in SparkR's glm.
> Umbrella design doc for RFormula integration: 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?pli=1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9681) Support R feature interactions in RFormula

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9681:
-
Assignee: Eric Liang

> Support R feature interactions in RFormula
> --
>
> Key: SPARK-9681
> URL: https://issues.apache.org/jira/browse/SPARK-9681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Eric Liang
>Assignee: Eric Liang
>
> Support the interaction (":") operator RFormula feature transformer, so that 
> it is available for use in SparkR's glm.
> Umbrella design doc for RFormula integration: 
> https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit?pli=1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10631:
--
Shepherd: Xiangrui Meng

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10715.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8837
[https://github.com/apache/spark/pull/8837]

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Trivial
> Fix For: 1.6.0
>
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10715:
--
Assignee: Kai Sasaki

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Assignee: Kai Sasaki
>Priority: Trivial
> Fix For: 1.6.0
>
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10715:
--
Target Version/s: 1.6.0

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Assignee: Kai Sasaki
>Priority: Trivial
> Fix For: 1.6.0
>
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10686:
--
Shepherd: Xiangrui Meng

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5905.
--
Resolution: Fixed

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 1.6.0
>
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5905:
-
Target Version/s: 1.6.0
   Fix Version/s: 1.6.0

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 1.6.0
>
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5905:
-
Assignee: Sean Owen

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Trivial
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10721) Log warning when file deletion fails

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10721:


Assignee: Apache Spark

> Log warning when file deletion fails
> 
>
> Key: SPARK-10721
> URL: https://issues.apache.org/jira/browse/SPARK-10721
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> There're several places in the code base where return value from 
> File.delete() is ignored.
> This issue adds checking for the boolean return value and logs warning when 
> deletion fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10721) Log warning when file deletion fails

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10721:


Assignee: (was: Apache Spark)

> Log warning when file deletion fails
> 
>
> Key: SPARK-10721
> URL: https://issues.apache.org/jira/browse/SPARK-10721
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> There're several places in the code base where return value from 
> File.delete() is ignored.
> This issue adds checking for the boolean return value and logs warning when 
> deletion fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10721) Log warning when file deletion fails

2015-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900098#comment-14900098
 ] 

Apache Spark commented on SPARK-10721:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/8843

> Log warning when file deletion fails
> 
>
> Key: SPARK-10721
> URL: https://issues.apache.org/jira/browse/SPARK-10721
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
>
> There're several places in the code base where return value from 
> File.delete() is ignored.
> This issue adds checking for the boolean return value and logs warning when 
> deletion fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10721) Log warning when file deletion fails

2015-09-20 Thread Ted Yu (JIRA)
Ted Yu created SPARK-10721:
--

 Summary: Log warning when file deletion fails
 Key: SPARK-10721
 URL: https://issues.apache.org/jira/browse/SPARK-10721
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Priority: Minor


There're several places in the code base where return value from File.delete() 
is ignored.

This issue adds checking for the boolean return value and logs warning when 
deletion fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900091#comment-14900091
 ] 

Apache Spark commented on SPARK-9852:
-

User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/8844

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9852) Let reduce tasks fetch multiple map output partitions

2015-09-20 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-9852:
-
Summary: Let reduce tasks fetch multiple map output partitions  (was: Let 
HashShuffleFetcher fetch multiple map output partitions)

> Let reduce tasks fetch multiple map output partitions
> -
>
> Key: SPARK-9852
> URL: https://issues.apache.org/jira/browse/SPARK-9852
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Check License should not verify conf files for license

2015-09-20 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Priority: Minor  (was: Major)

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Priority: Minor
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file

2015-09-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10297:
-
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> When save data to a data source table, we should bound the size of a saved 
> file
> ---
>
> Key: SPARK-10297
> URL: https://issues.apache.org/jira/browse/SPARK-10297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we save a table to a data source table, it is possible that a writer is 
> responsible to write out a larger number of rows, which can make the 
> generated file very large and cause job failed if the underlying storage 
> system has a limit of max file size (e.g. S3's limit is 5GB). We should bound 
> the size of a file generated by a writer and create new writers for the same 
> partition if necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file

2015-09-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10297:
-
Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> When save data to a data source table, we should bound the size of a saved 
> file
> ---
>
> Key: SPARK-10297
> URL: https://issues.apache.org/jira/browse/SPARK-10297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> When we save a table to a data source table, it is possible that a writer is 
> responsible to write out a larger number of rows, which can make the 
> generated file very large and cause job failed if the underlying storage 
> system has a limit of max file size (e.g. S3's limit is 5GB). We should bound 
> the size of a file generated by a writer and create new writers for the same 
> partition if necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10517) Console "Output" field is empty when using DataFrameWriter.json

2015-09-20 Thread Hans van den Bogert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900038#comment-14900038
 ] 

Hans van den Bogert commented on SPARK-10517:
-

I never see output size. I've tried local disk output write and HDFS output 
write. Do you see values when using non-json formatted?

> Console "Output" field is empty when using DataFrameWriter.json
> ---
>
> Key: SPARK-10517
> URL: https://issues.apache.org/jira/browse/SPARK-10517
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> On HTTP application UI "Output" field is empty when using 
> DataFrameWriter.json.
> Should by size of written bytes.
> Screenshot attached,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10588) Saving a DataFrame containing only nulls to JSON doesn't work

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10588:

Target Version/s: 1.6.0  (was: 1.6.0, 1.5.1)

> Saving a DataFrame containing only nulls to JSON doesn't work
> -
>
> Key: SPARK-10588
> URL: https://issues.apache.org/jira/browse/SPARK-10588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Minor
>
> Snippets to reproduce this issue:
> {noformat}
> val path = "file:///tmp/spark/null"
> // A single row containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // Two rows each containing a single null double, saving to JSON, wrong
> sqlContext.
>   range(2).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ||
> ++
> // A single row containing two null doubles, saving to JSON, wrong
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0", "CAST(NULL AS DOUBLE) AS 
> c1").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> ||
> ++
> ||
> ++
> // A single row containing a single null double, saving to Parquet, OK
> sqlContext.
>   range(1).selectExpr("CAST(NULL AS DOUBLE) AS c0").
>   write.mode("overwrite").parquet(path)
> sqlContext.read.parquet(path).show()
> ++
> |   d|
> ++
> |null|
> ++
> // Two rows, one containing a single null double, one containing non-null 
> double, saving to JSON, OK
> sqlContext.
>   range(2).selectExpr("IF(id % 2 = 0, CAST(NULL AS DOUBLE), id) AS c0").
>   write.mode("overwrite").json(path)
> sqlContext.read.json(path).show()
> ++
> |   d|
> ++
> |null|
> | 1.0|
> ++
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10337) Views are broken

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10337:

Target Version/s: 1.6.0  (was: 1.5.1)

> Views are broken
> 
>
> Key: SPARK-10337
> URL: https://issues.apache.org/jira/browse/SPARK-10337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> I haven't dug into this yet... but it seems like this should work:
> This works:
> {code}
> SELECT * FROM 100milints
> {code}
> This seems to work:
> {code}
> CREATE VIEW testView AS SELECT * FROM 100milints
> {code}
> This fails:
> {code}
> SELECT * FROM testView
> org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given 
> input columns id; line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 

[jira] [Commented] (SPARK-6484) Ganglia metrics xml reporter doesn't escape correctly

2015-09-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900031#comment-14900031
 ] 

Reynold Xin commented on SPARK-6484:


Is this still a problem?

> Ganglia metrics xml reporter doesn't escape correctly
> -
>
> Key: SPARK-6484
> URL: https://issues.apache.org/jira/browse/SPARK-6484
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Michael Armbrust
>Assignee: Josh Rosen
>Priority: Critical
>
> The following should be escaped:
> {code}
> "   "
> '   '
> <   <
> >   >
> &   &
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10685:

Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1)

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Priority: Blocker
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(1) after

[jira] [Updated] (SPARK-10640) Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10640:

Description: 
I'm seeing an exception from the spark history server trying to read a history 
file:

{code}
scala.MatchError: TaskCommitDenied (of class java.lang.String)
at 
org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
at 
org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
at 
org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}


  was:
I'm seeing an exception from the spark history server trying to read a history 
file:

scala.MatchError: TaskCommitDenied (of class java.lang.String)
at 
org.apache.spark.util.JsonProtocol$.taskEndReasonFromJson(JsonProtocol.scala:775)
at 
org.apache.spark.util.JsonProtocol$.taskEndFromJson(JsonProtocol.scala:531)
at 
org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:488)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



> Spark history server fails to parse taskEndReasonFromJson TaskCommitDenied
> --
>
> Key: SPARK-10640
> URL: https://issues.apache.org/jira/browse/SPARK-10640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Thomas Graves
>Assignee: Andrew Or
>Priority: Blocker
>
> I'm seeing an exception from the spark history server trying to read a 
> history 

[jira] [Updated] (SPARK-10672) We should not fail to create a table If we cannot persist metadata of a data source table to metastore in a Hive compatible way

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10672:

Target Version/s: 1.6.0, 1.5.1  (was: 1.5.1)

> We should not fail to create a table If we cannot persist metadata of a data 
> source table to metastore in a Hive compatible way
> ---
>
> Key: SPARK-10672
> URL: https://issues.apache.org/jira/browse/SPARK-10672
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> It is possible that Hive has some internal restrictions on what kinds of 
> metadata of a table it accepts (e.g. Hive 0.13 does not support decimal 
> stored in parquet). If it is the case, we should not fail when we try to 
> store the metadata in a Hive compatible way. We should just save it in the 
> Spark SQL specific format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10544) Serialization of Python namedtuple subclasses in functions / closures is broken

2015-09-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10544:

Fix Version/s: 1.5.1
   1.6.0

> Serialization of Python namedtuple subclasses in functions / closures is 
> broken
> ---
>
> Key: SPARK-10544
> URL: https://issues.apache.org/jira/browse/SPARK-10544
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0, 1.5.1
>
>
> The following example works on Spark 1.4.1 but not in 1.5:
> {code}
> from collections import namedtuple
> Person = namedtuple("Person", "id firstName lastName")
> rdd = sc.parallelize([1]).map(lambda x: Person(1, "Jon", "Doe"))
> rdd.count()
> {code}
> In 1.5, this gives an "AttributeError: 'builtin_function_or_method' object 
> has no attribute '__code__'" error.
> Digging a bit deeper, it seems that the problem is the serialization of the 
> {{Person}} class itself, since serializing _instances_ of the class in the 
> closure seems to work properly:
> {code}
> from collections import namedtuple
> Person = namedtuple("Person", "id firstName lastName")
> jon = Person(1, "Jon", "Doe")
> rdd = sc.parallelize([1]).map(lambda x: jon)
> rdd.count()
> {code}
> It looks like PySpark has unit tests for serializing individual namedtuples 
> with cloudpickle.dumps and for serializing RDDs of namedtuples, but I don't 
> think that we have any tests for namedtuple classes in closures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-20 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900019#comment-14900019
 ] 

holdenk commented on SPARK-10720:
-

While its not blocked on this I'm going to wait for SPARK-10630 to go in first 
and then I'll give this one a shot.

> Add a java wrapper to create dataframe from a local list of Java Beans.
> ---
>
> Key: SPARK-10720
> URL: https://issues.apache.org/jira/browse/SPARK-10720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: holdenk
>Priority: Minor
>
> Similar to SPARK-10630 it would be nice if Java users didn't have to 
> parallelize there data explicitly (as Scala users already can skip). Issue 
> came up in 
> http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10720) Add a java wrapper to create dataframe from a local list of Java Beans.

2015-09-20 Thread holdenk (JIRA)
holdenk created SPARK-10720:
---

 Summary: Add a java wrapper to create dataframe from a local list 
of Java Beans.
 Key: SPARK-10720
 URL: https://issues.apache.org/jira/browse/SPARK-10720
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: holdenk
Priority: Minor


Similar to SPARK-10630 it would be nice if Java users didn't have to 
parallelize there data explicitly (as Scala users already can skip). Issue came 
up in 
http://stackoverflow.com/questions/32613413/apache-spark-machine-learning-cant-get-estimator-example-to-work?answertab=active#tab-top
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10630) createDataFrame from a Java List

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10630:


Assignee: (was: Apache Spark)

> createDataFrame from a Java List
> -
>
> Key: SPARK-10630
> URL: https://issues.apache.org/jira/browse/SPARK-10630
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It would be nice to support creating a DataFrame directly from a Java List of 
> Row:
> {code}
> def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10630) createDataFrame from a Java List

2015-09-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900017#comment-14900017
 ] 

Apache Spark commented on SPARK-10630:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8779

> createDataFrame from a Java List
> -
>
> Key: SPARK-10630
> URL: https://issues.apache.org/jira/browse/SPARK-10630
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It would be nice to support creating a DataFrame directly from a Java List of 
> Row:
> {code}
> def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10630) createDataFrame from a Java List

2015-09-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10630:


Assignee: Apache Spark

> createDataFrame from a Java List
> -
>
> Key: SPARK-10630
> URL: https://issues.apache.org/jira/browse/SPARK-10630
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to support creating a DataFrame directly from a Java List of 
> Row:
> {code}
> def createDataFrame(data: java.util.List[Row], schema: StructType): DataFrame
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10718) Check License should not verify conf files for license

2015-09-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1490#comment-1490
 ] 

Sean Owen commented on SPARK-10718:
---

There's no such file in the repo or release though: 
https://github.com/apache/spark/tree/master/conf

This is a check for the release artifacts, and doesn't need to be run by 
developers, on their own installation which may indeed have a bunch of other 
files.

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10

2015-09-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1481#comment-1481
 ] 

Shixiong Zhu commented on SPARK-10719:
--

Actually, other places that use `TypeTag` as context bound have this issue too, 
such as {{SQLImplicits.localSeqToDataFrameHolder}}, 
{{SQLContext.createDataFrame}}

> SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10
> --
>
> Key: SPARK-10719
> URL: https://issues.apache.org/jira/browse/SPARK-10719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
> Environment: Scala 2.10
>Reporter: Shixiong Zhu
>
> Sometimes the following codes failed
> {code}
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> (1 to 1000).par.foreach { _ =>
>   sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
> }
> {code}
> The stack trace is 
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: tail of 
> empty list
>   at scala.collection.immutable.Nil$.tail(List.scala:339)
>   at scala.collection.immutable.Nil$.tail(List.scala:334)
>   at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
>   at 
> scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
>   at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
>   at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
>   at 
> scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
>   at 
> scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at 
> scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
>   at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
>   at scala.Function1$class.apply$mcVI$sp(Function1.scala:39)
>   at 
> scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12)
>   at 
> scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at 
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>   at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>   at 
> scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>   at 
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>   at 
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> Finally, I found the problem. The codes generated by Scala compiler to find 
> the implicit TypeTag are not thread safe because of an issue in Scala 2.10: 
> https://issues.scala-lang.org/browse/SI-6240
> This issue was fixed in Scala 2.11 but not backported to 2.10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10719) SQLImplicits.rddToDataFrameHolder is not thread safe when using Scala 2.10

2015-09-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-10719:


 Summary: SQLImplicits.rddToDataFrameHolder is not thread safe when 
using Scala 2.10
 Key: SPARK-10719
 URL: https://issues.apache.org/jira/browse/SPARK-10719
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0, 1.4.1, 1.3.1
 Environment: Scala 2.10
Reporter: Shixiong Zhu


Sometimes the following codes failed
{code}
val conf = new SparkConf().setAppName("sql-memory-leak")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
(1 to 1000).par.foreach { _ =>
  sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
}
{code}
The stack trace is 
{code}
Exception in thread "main" java.lang.UnsupportedOperationException: tail of 
empty list
at scala.collection.immutable.Nil$.tail(List.scala:339)
at scala.collection.immutable.Nil$.tail(List.scala:334)
at scala.reflect.internal.SymbolTable.popPhase(SymbolTable.scala:172)
at 
scala.reflect.internal.Symbols$Symbol.unsafeTypeParams(Symbols.scala:1477)
at scala.reflect.internal.Symbols$TypeSymbol.tpe(Symbols.scala:2777)
at scala.reflect.internal.Mirrors$RootsBase.init(Mirrors.scala:235)
at 
scala.reflect.runtime.JavaMirrors$class.createMirror(JavaMirrors.scala:34)
at 
scala.reflect.runtime.JavaMirrors$class.runtimeMirror(JavaMirrors.scala:61)
at 
scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
at 
scala.reflect.runtime.JavaUniverse.runtimeMirror(JavaUniverse.scala:12)
at SparkApp$$anonfun$main$1.apply$mcJI$sp(SparkApp.scala:16)
at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
at SparkApp$$anonfun$main$1.apply(SparkApp.scala:15)
at scala.Function1$class.apply$mcVI$sp(Function1.scala:39)
at 
scala.runtime.AbstractFunction1.apply$mcVI$sp(AbstractFunction1.scala:12)
at 
scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
at 
scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:975)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at 
scala.collection.parallel.ParIterableLike$Foreach.tryLeaf(ParIterableLike.scala:972)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:172)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

Finally, I found the problem. The codes generated by Scala compiler to find the 
implicit TypeTag are not thread safe because of an issue in Scala 2.10: 
https://issues.scala-lang.org/browse/SI-6240
This issue was fixed in Scala 2.11 but not backported to 2.10.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10559) DataFrame schema ArrayType should accept ResultIterable

2015-09-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14899952#comment-14899952
 ] 

Maciej Bryński commented on SPARK-10559:


I did some performance tests with both solutions and I didn't found any 
differences.
So I can agreen with you.

> DataFrame schema ArrayType should accept ResultIterable
> ---
>
> Key: SPARK-10559
> URL: https://issues.apache.org/jira/browse/SPARK-10559
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> If I'm using RDD.groupBy I'm getting  pyspark.resultiterable.ResultIterable.
> Right now I can't createDataFrame for it.
> {code}
> TypeError: StructType(...) can not accept object in type  'pyspark.resultiterable.ResultIterable'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4503) The history server is not compatible with HDFS HA

2015-09-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4503.
--
Resolution: Cannot Reproduce

> The history server is not compatible with HDFS HA
> -
>
> Key: SPARK-4503
> URL: https://issues.apache.org/jira/browse/SPARK-4503
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: MarsXu
>Priority: Minor
> Attachments: historyserver1.png
>
>
>   I use a high availability of HDFS to store the history server data.
>   Can be written eventlog to HDFS , but history server cannot be started.
>   
>   Error log when execute "sbin/start-history-server.sh":
> {quote}
> 
> 14/11/20 10:25:04 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root, ); users 
> with modify permissions: Set(root, )
> 14/11/20 10:25:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:187)
> at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> appcluster
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> 
> {quote}
> When I set  SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://s161.zw.db.d:53310/spark_history">
>  in spark-evn.sh, can start, but no high availability.
> Environment
> {quote}
> spark-1.1.0-bin-hadoop2.4
> hadoop-2.5.1
> zookeeper-3.4.6
> {quote}
>   The config file is as follows:
> {quote}
> !### spark-defaults.conf ###
> spark.eventLog.dirhdfs://appcluster/history_server/
> spark.yarn.historyServer.addresss161.zw.db.d:18080
> !### spark-env.sh ###
> export 
> SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://appcluster/history_server"
> !### core-site.xml ###
> 
> fs.defaultFS
> hdfs://appcluster
> 
> !### hdfs-site.xml ###
> 
> dfs.nameservices
> appcluster
> 
> 
> dfs.ha.namenodes.appcluster
> nn1,nn2
> 
> 
> dfs.namenode.rpc-address.appcluster.nn1
> s161.zw.db.d:8020
> 
> 
> dfs.namenode.rpc-address.appcluster.nn2
> s162.zw.db.d:8020
> 
> 
> dfs.namenode.servicerpc-address.appcluster.nn1
> s161.zw.db.d:53310
> 
> 
> dfs.namenode.servicerpc-address.appcluster.nn2
> s162.zw.db.d:53310
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10141) Number of tasks on executors still become negative after failures

2015-09-20 Thread Ohad Zadok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877454#comment-14877454
 ] 

Ohad Zadok edited comment on SPARK-10141 at 9/20/15 7:57 AM:
-

happens to me as well on 1.5.0
I'm running LDA on ~1.5 Million documents, with 10 folds. 
When running with a lower number of folds (5) it sometimes manage to complete a 
full run.

Adding a stacktrace:

15/09/18 02:17:33 WARN scheduler.TaskSetManager: Lost task 47.0 in stage 45.0 
(TID 31026, 172.31.15.3): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 79 in 
stage 45.0 failed 4 times, most recent failure: Lost task 79.3 in stage 45.0 
(TID 31021, 172.31.15.8): java.io.IOException: Failed to connect to 
/172.31.15.8:53373
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /172.31.15.8:53373
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1053)
at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205)
at 
org.apache.spark.

[jira] [Comment Edited] (SPARK-10141) Number of tasks on executors still become negative after failures

2015-09-20 Thread Ohad Zadok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877454#comment-14877454
 ] 

Ohad Zadok edited comment on SPARK-10141 at 9/20/15 7:56 AM:
-

happens to me as well on 1.5.0
When running LDA on ~1.5 Million documents, with 10 folds. 
When running with a lower number of folds (5) it sometimes manage to complete a 
full run.

Adding a stacktrace:

15/09/18 02:17:33 WARN scheduler.TaskSetManager: Lost task 47.0 in stage 45.0 
(TID 31026, 172.31.15.3): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 79 in 
stage 45.0 failed 4 times, most recent failure: Lost task 79.3 in stage 45.0 
(TID 31021, 172.31.15.8): java.io.IOException: Failed to connect to 
/172.31.15.8:53373
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /172.31.15.8:53373
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1053)
at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205)
at 
org.apache.spark

[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures

2015-09-20 Thread Ohad Zadok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877454#comment-14877454
 ] 

Ohad Zadok commented on SPARK-10141:


happens to me as well on 1.5.0
When running LDA on ~1.5 Million documents, with 10 folds. 
When running with a lower number of folds (5) it sometimes manage to complete a 
full run.

Adding a stacktrace:
15/09/18 02:17:33 WARN scheduler.TaskSetManager: Lost task 47.0 in stage 45.0 
(TID 31026, 172.31.15.3): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 79 in 
stage 45.0 failed 4 times, most recent failure: Lost task 79.3 in stage 45.0 
(TID 31021, 172.31.15.8): java.io.IOException: Failed to connect to 
/172.31.15.8:53373
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /172.31.15.8:53373
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1933)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1059)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1053)
at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205)
at 
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer