[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Alberto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491993#comment-14491993
 ] 

Alberto commented on SPARK-4783:


Does it mean that you guys are going to create a PR with a fix/change proposal 
for this? Or just asking someone to create that PR? If so I am willing to 
create it.

> System.exit() calls in SparkContext disrupt applications embedding Spark
> 
>
> Key: SPARK-4783
> URL: https://issues.apache.org/jira/browse/SPARK-4783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: David Semeria
>
> A common architectural choice for integrating Spark within a larger 
> application is to employ a gateway to handle Spark jobs. The gateway is a 
> server which contains one or more long-running sparkcontexts.
> A typical server is created with the following pseudo code:
> var continue = true
> while (continue){
>  try {
> server.run() 
>   } catch (e) {
>   continue = log_and_examine_error(e)
> }
> The problem is that sparkcontext frequently calls System.exit when it 
> encounters a problem which means the server can only be re-spawned at the 
> process level, which is much more messy than the simple code above.
> Therefore, I believe it makes sense to replace all System.exit calls in 
> sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-5888:
-

Assignee: Sandy Ryza

> Add OneHotEncoder as a Transformer
> --
>
> Key: SPARK-5888
> URL: https://issues.apache.org/jira/browse/SPARK-5888
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Sandy Ryza
>
> `OneHotEncoder` takes a categorical column and output a vector column, which 
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
>   .setInputCol("countryIndex")
>   .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names 
> properly in the output column. We need to discuss the default naming scheme 
> and whether we should let it process multiple categorical columns at the same 
> time.
> One category (the most frequent one) should be removed from the output to 
> make the output columns linear independent. Or this could be an option tuned 
> on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6877:
---

Assignee: Apache Spark

> Add code generation support for Min
> ---
>
> Key: SPARK-6877
> URL: https://issues.apache.org/jira/browse/SPARK-6877
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6877:
--

 Summary: Add code generation support for Min
 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492041#comment-14492041
 ] 

Apache Spark commented on SPARK-6877:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5487

> Add code generation support for Min
> ---
>
> Key: SPARK-6877
> URL: https://issues.apache.org/jira/browse/SPARK-6877
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6877:
---

Assignee: (was: Apache Spark)

> Add code generation support for Min
> ---
>
> Key: SPARK-6877
> URL: https://issues.apache.org/jira/browse/SPARK-6877
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6849) The constructor of GradientDescent should be public

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6849.
--
Resolution: Duplicate

Yes, I think this is a subset of "opening up optimization APIs"

> The constructor of GradientDescent should be public
> ---
>
> Key: SPARK-6849
> URL: https://issues.apache.org/jira/browse/SPARK-6849
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Guoqiang Li
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492126#comment-14492126
 ] 

Sean Owen commented on SPARK-6847:
--

Can you provide (the top part of) the stack overflow stack? so we can see where 
it's occurring. I think it's something building a very long object graph but 
that is the first step to confirm.

> Stack overflow on updateStateByKey which followed by a dstream with 
> checkpoint set
> --
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Jack Hu
>  Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} 
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", )
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) => 
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
>   println("Deep: " + rdd.toDebugString.split("\n").length)
>   println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over 
> time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
> stack overflow will happen. 
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
> not the {{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492215#comment-14492215
 ] 

Sean Owen commented on SPARK-1529:
--

(Sorry if this double-posts.)

Is there a good way to see the whole diff at the moment? I know there's a 
branch with individual commits. Maybe I am missing something basic.

This puts a new abstraction on top of a Hadoop FileSystem on top of the 
underlying file system abstraction. That's getting heavy. If it's only 
abstracting access to an InputStream / OutputStream, why is it needed? that's 
already directly available from, say, Hadoop's FileSystem.

What would be the performance gain if this is the bit being swapped out? This 
is my original question -- you shuffle to HDFS, then read it back to send it 
again via the existing shuffle? It kind of made sense when the idea was to swap 
the whole shuffle to replace its transport.

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Kannan Rajah
> Attachments: Spark Shuffle using HDFS.pdf
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-13 Thread Yajun Dong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492246#comment-14492246
 ] 

Yajun Dong commented on SPARK-5281:
---

I also have this isssue with Eclipse Luna and spark 1.3.0, any idea ?

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  {{rdd.registerTempTable("temp")}}  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace:
> {code}
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6800:
---

Assignee: Apache Spark

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>Assignee: Apache Spark
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492244#comment-14492244
 ] 

Apache Spark commented on SPARK-6800:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5488

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6800:
---

Assignee: (was: Apache Spark)

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)
Erik van Oosten created SPARK-6878:
--

 Summary: Sum on empty RDD fails with exception
 Key: SPARK-6878
 URL: https://issues.apache.org/jira/browse/SPARK-6878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Erik van Oosten
Priority: Minor


{{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.

A simple fix is the replace

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.reduce(_ + _)
{noformat} 

with:

{noformat}
class DoubleRDDFunctions {
  def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492271#comment-14492271
 ] 

Sean Owen commented on SPARK-6878:
--

Interesting question -- what's the expected sum of nothing at all? although I 
can see the argument both ways, 0 is probably the better result since 
{{Array[Double]().sum}} is 0. So {{sc.parallelize(Array[Double]()).sum}} should 
as well. Want to make a PR?

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6738) EstimateSize is difference with spill file size

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6738.
--
Resolution: Not A Problem

We can reopen if there is more detail, but the problem report is focusing on 
the size of one spill file when there are lots of them. The in-memory size is 
also not necessarily the on-disk size. I haven't seen a report of a problem 
here either, like something that then fails.

> EstimateSize  is difference with spill file size
> 
>
> Key: SPARK-6738
> URL: https://issues.apache.org/jira/browse/SPARK-6738
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> ExternalAppendOnlyMap spill 2.2 GB data to disk:
> {code}
> 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling 
> in-memory map of 2.2 GB to disk (61 times so far)
> 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: 
> /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
> {code}
> But the file size is only 2.2M.
> {code}
> ll -h 
> /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/
> total 2.2M
> -rw-r- 1 spark users 2.2M Apr  7 20:27 
> temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812
> {code}
> The GC log show that the jvm memory is less than 1GB.
> {code}
> 2015-04-07T20:27:08.023+0800: [GC 981981K->55363K(3961344K), 0.0341720 secs]
> 2015-04-07T20:27:14.483+0800: [GC 987523K->53737K(3961344K), 0.0252660 secs]
> 2015-04-07T20:27:20.793+0800: [GC 985897K->56370K(3961344K), 0.0606460 secs]
> 2015-04-07T20:27:27.553+0800: [GC 988530K->59089K(3961344K), 0.0651840 secs]
> 2015-04-07T20:27:34.067+0800: [GC 991249K->62153K(3961344K), 0.0288460 secs]
> 2015-04-07T20:27:40.180+0800: [GC 994313K->61344K(3961344K), 0.0388970 secs]
> 2015-04-07T20:27:46.490+0800: [GC 993504K->59915K(3961344K), 0.0235150 secs]
> {code}
> The estimateSize  is hugh difference with spill file size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6868.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2
 Assignee: Dean Chen

> Container link broken on Spark UI Executors page when YARN is set to 
> HTTPS_ONLY
> ---
>
> Key: SPARK-6868
> URL: https://issues.apache.org/jira/browse/SPARK-6868
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
>Reporter: Dean Chen
>Assignee: Dean Chen
> Fix For: 1.3.2, 1.4.0
>
> Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png
>
>
> The stdout and stderr log links on the executor page will use the http:// 
> prefix even if the node manager does not support http and only https via 
> setting yarn.http.policy=HTTPS_ONLY.
> Unfortunately the unencrypted http link in that case does not return a 404 
> but a binary file containing random binary chars. This causes a lot of 
> confusion for the end user since it seems like the log file exists and is 
> just filled with garbage. (see attached screenshot)
> The fix is to prefix container log links with https:// instead of http:// if 
> yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
> here: 
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6868) Container link broken on Spark UI Executors page when YARN is set to HTTPS_ONLY

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6868:
-
Priority: Minor  (was: Major)

> Container link broken on Spark UI Executors page when YARN is set to 
> HTTPS_ONLY
> ---
>
> Key: SPARK-6868
> URL: https://issues.apache.org/jira/browse/SPARK-6868
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0
>Reporter: Dean Chen
>Assignee: Dean Chen
>Priority: Minor
> Fix For: 1.3.2, 1.4.0
>
> Attachments: Screen Shot 2015-04-11 at 11.49.21 PM.png
>
>
> The stdout and stderr log links on the executor page will use the http:// 
> prefix even if the node manager does not support http and only https via 
> setting yarn.http.policy=HTTPS_ONLY.
> Unfortunately the unencrypted http link in that case does not return a 404 
> but a binary file containing random binary chars. This causes a lot of 
> confusion for the end user since it seems like the log file exists and is 
> just filled with garbage. (see attached screenshot)
> The fix is to prefix container log links with https:// instead of http:// if 
> yarn.http.policy=HTTPS_ONLY. YARN's job page has this exact logic as seen 
> here: 
> https://github.com/apache/hadoop/blob/e1109fb65608a668cd53dc324dadc6f63a74eeb9/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/webapp/JobBlock.java#L108



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492282#comment-14492282
 ] 

Erik van Oosten commented on SPARK-6878:


The answer is only defined because the RDD is an {{RDD[Double]}} :)

Sure, I'll make a PR.

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492282#comment-14492282
 ] 

Erik van Oosten edited comment on SPARK-6878 at 4/13/15 11:16 AM:
--

The answer is only defined because the RDD is an {{RDD[Double]}} :)

Sure, I'll make a PR.
Is the proposed solution acceptable?


was (Author: erikvanoosten):
The answer is only defined because the RDD is an {{RDD[Double]}} :)

Sure, I'll make a PR.

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6762:
-
Assignee: zhichao-li

> Fix potential resource leaks in CheckPoint CheckpointWriter and 
> CheckpointReader
> 
>
> Key: SPARK-6762
> URL: https://issues.apache.org/jira/browse/SPARK-6762
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: zhichao-li
>Assignee: zhichao-li
>Priority: Minor
> Fix For: 1.4.0
>
>
> The close action should be placed within finally block to avoid the potential 
> resource leaks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6762.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5407
[https://github.com/apache/spark/pull/5407]

> Fix potential resource leaks in CheckPoint CheckpointWriter and 
> CheckpointReader
> 
>
> Key: SPARK-6762
> URL: https://issues.apache.org/jira/browse/SPARK-6762
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: zhichao-li
>Priority: Minor
> Fix For: 1.4.0
>
>
> The close action should be placed within finally block to avoid the potential 
> resource leaks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492284#comment-14492284
 ] 

Sean Owen commented on SPARK-6878:
--

Yes, and I think it could even be a little simpler by calling {{fold(0.0)(_ + 
_)}} ?

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6860) Fix the possible inconsistency of StreamingPage

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6860.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5470
[https://github.com/apache/spark/pull/5470]

> Fix the possible inconsistency of StreamingPage
> ---
>
> Key: SPARK-6860
> URL: https://issues.apache.org/jira/browse/SPARK-6860
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Shixiong Zhu
> Fix For: 1.4.0
>
>
> Because "StreamingPage.render" doesn't hold the "listener" lock when 
> generating the content, the different parts of content may have some 
> inconsistent values if "listener" updates its status at the same time. And it 
> will confuse people.
> We should add "listener.synchronized" to make sure we have a consistent view 
> of StreamingJobProgressListener when creating the content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6860) Fix the possible inconsistency of StreamingPage

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6860:
-
Priority: Minor  (was: Major)
Assignee: Shixiong Zhu

> Fix the possible inconsistency of StreamingPage
> ---
>
> Key: SPARK-6860
> URL: https://issues.apache.org/jira/browse/SPARK-6860
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.4.0
>
>
> Because "StreamingPage.render" doesn't hold the "listener" lock when 
> generating the content, the different parts of content may have some 
> inconsistent values if "listener" updates its status at the same time. And it 
> will confuse people.
> We should add "listener.synchronized" to make sure we have a consistent view 
> of StreamingJobProgressListener when creating the content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492302#comment-14492302
 ] 

Erik van Oosten commented on SPARK-6878:


Ah, yes. I now see that fold also first reduces per partition.

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2015-04-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492303#comment-14492303
 ] 

Steve Loughran commented on SPARK-1537:
---

HADOOP-11826 patches the hadoop compatibility document to add timeline server 
to the list of stable APIs.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6440) ipv6 URI for HttpServer

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6440.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5424
[https://github.com/apache/spark/pull/5424]

> ipv6 URI for HttpServer
> ---
>
> Key: SPARK-6440
> URL: https://issues.apache.org/jira/browse/SPARK-6440
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
> Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
>Reporter: Arsenii Krasikov
>Priority: Minor
> Fix For: 1.4.0
>
>
> In {{org.apache.spark.HttpServer}} uri is generated as {code:java}"spark://" 
> + localHostname + ":" + masterPort{code}, where {{localHostname}} is 
> {code:java} org.apache.spark.util.Utils.localHostName() = 
> customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
> ipv6 address then it would be interpolated into invalid URI:  
> {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
> {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
> The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6440) ipv6 URI for HttpServer

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6440:
-
Assignee: Arsenii Krasikov

> ipv6 URI for HttpServer
> ---
>
> Key: SPARK-6440
> URL: https://issues.apache.org/jira/browse/SPARK-6440
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
> Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster
>Reporter: Arsenii Krasikov
>Assignee: Arsenii Krasikov
>Priority: Minor
> Fix For: 1.4.0
>
>
> In {{org.apache.spark.HttpServer}} uri is generated as {code:java}"spark://" 
> + localHostname + ":" + masterPort{code}, where {{localHostname}} is 
> {code:java} org.apache.spark.util.Utils.localHostName() = 
> customHostname.getOrElse(localIpAddressHostname){code}. If the host has an 
> ipv6 address then it would be interpolated into invalid URI:  
> {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of 
> {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}.
> The solution is to separate uri and hostname entities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6671) Add status command for spark daemons

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6671.
--
Resolution: Fixed

Issue resolved by pull request 5327
[https://github.com/apache/spark/pull/5327]

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Tao Wang (JIRA)
Tao Wang created SPARK-6879:
---

 Summary: Check if the app is completed before clean it up
 Key: SPARK-6879
 URL: https://issues.apache.org/jira/browse/SPARK-6879
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Tao Wang


Now history server deletes the directory whichi expires according to its 
modification time. It is not good for those long-running applicaitons, as they 
might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6671) Add status command for spark daemons

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6671:
-
Priority: Minor  (was: Major)
Assignee: PRADEEP CHANUMOLU

> Add status command for spark daemons
> 
>
> Key: SPARK-6671
> URL: https://issues.apache.org/jira/browse/SPARK-6671
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: PRADEEP CHANUMOLU
>Assignee: PRADEEP CHANUMOLU
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently using the spark-daemon.sh script we can start and stop the spark 
> demons. But we cannot get the status of the daemons. It will be nice to 
> include the status command in the spark-daemon.sh script, through which we 
> can know if the spark demon is alive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6870.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5479
[https://github.com/apache/spark/pull/5479]

> Catch InterruptedException when yarn application state monitor thread been 
> interrupted
> --
>
> Key: SPARK-6870
> URL: https://issues.apache.org/jira/browse/SPARK-6870
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Weizhong
>Priority: Minor
> Fix For: 1.4.0
>
>
> On PR #5305 we interrupt the monitor thread but forget to catch the 
> InterruptedException, then in the log will print the stack info, so we need 
> to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6870) Catch InterruptedException when yarn application state monitor thread been interrupted

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6870:
-
Priority: Trivial  (was: Minor)
Assignee: Weizhong

> Catch InterruptedException when yarn application state monitor thread been 
> interrupted
> --
>
> Key: SPARK-6870
> URL: https://issues.apache.org/jira/browse/SPARK-6870
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Weizhong
>Assignee: Weizhong
>Priority: Trivial
> Fix For: 1.4.0
>
>
> On PR #5305 we interrupt the monitor thread but forget to catch the 
> InterruptedException, then in the log will print the stack info, so we need 
> to catch it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6878:
---

Assignee: Apache Spark

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Assignee: Apache Spark
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492336#comment-14492336
 ] 

Erik van Oosten commented on SPARK-6878:


Pull request: https://github.com/apache/spark/pull/5489

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6878:
---

Assignee: (was: Apache Spark)

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492335#comment-14492335
 ] 

Apache Spark commented on SPARK-6878:
-

User 'erikvanoosten' has created a pull request for this issue:
https://github.com/apache/spark/pull/5489

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6878) Sum on empty RDD fails with exception

2015-04-13 Thread Erik van Oosten (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik van Oosten updated SPARK-6878:
---
Flags: Patch

> Sum on empty RDD fails with exception
> -
>
> Key: SPARK-6878
> URL: https://issues.apache.org/jira/browse/SPARK-6878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Erik van Oosten
>Priority: Minor
>
> {{Sum}} on an empty RDD throws an exception. Expected result is {{0}}.
> A simple fix is the replace
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.reduce(_ + _)
> {noformat} 
> with:
> {noformat}
> class DoubleRDDFunctions {
>   def sum(): Double = self.aggregate(0.0)(_ + _, _ + _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492346#comment-14492346
 ] 

Sean Owen commented on SPARK-4783:
--

I have a PR ready, but am testing it. I am seeing test failures but am not sure 
if they're related. You are also welcome to go ahead with a PR if you think you 
have a handle on it and I can chime in with what I know.

> System.exit() calls in SparkContext disrupt applications embedding Spark
> 
>
> Key: SPARK-4783
> URL: https://issues.apache.org/jira/browse/SPARK-4783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: David Semeria
>
> A common architectural choice for integrating Spark within a larger 
> application is to employ a gateway to handle Spark jobs. The gateway is a 
> server which contains one or more long-running sparkcontexts.
> A typical server is created with the following pseudo code:
> var continue = true
> while (continue){
>  try {
> server.run() 
>   } catch (e) {
>   continue = log_and_examine_error(e)
> }
> The problem is that sparkcontext frequently calls System.exit when it 
> encounters a problem which means the server can only be re-spawned at the 
> process level, which is much more messy than the simple code above.
> Therefore, I believe it makes sense to replace all System.exit calls in 
> sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492348#comment-14492348
 ] 

Apache Spark commented on SPARK-6879:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/5491

> Check if the app is completed before clean it up
> 
>
> Key: SPARK-6879
> URL: https://issues.apache.org/jira/browse/SPARK-6879
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Tao Wang
>
> Now history server deletes the directory whichi expires according to its 
> modification time. It is not good for those long-running applicaitons, as 
> they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6879:
---

Assignee: Apache Spark

> Check if the app is completed before clean it up
> 
>
> Key: SPARK-6879
> URL: https://issues.apache.org/jira/browse/SPARK-6879
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Tao Wang
>Assignee: Apache Spark
>
> Now history server deletes the directory whichi expires according to its 
> modification time. It is not good for those long-running applicaitons, as 
> they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6879) Check if the app is completed before clean it up

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6879:
---

Assignee: (was: Apache Spark)

> Check if the app is completed before clean it up
> 
>
> Key: SPARK-6879
> URL: https://issues.apache.org/jira/browse/SPARK-6879
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Reporter: Tao Wang
>
> Now history server deletes the directory whichi expires according to its 
> modification time. It is not good for those long-running applicaitons, as 
> they might be deleted before finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5689:
---

Assignee: Apache Spark

> Document what can be run in different YARN modes
> 
>
> Key: SPARK-5689
> URL: https://issues.apache.org/jira/browse/SPARK-5689
> Project: Spark
>  Issue Type: Documentation
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> We should document what can be run in the different yarn modes. For 
> instances, the interactive shell only work in yarn client mode, recently with 
> https://github.com/apache/spark/pull/3976 users can run python scripts in 
> cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5689:
---

Assignee: (was: Apache Spark)

> Document what can be run in different YARN modes
> 
>
> Key: SPARK-5689
> URL: https://issues.apache.org/jira/browse/SPARK-5689
> Project: Spark
>  Issue Type: Documentation
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> We should document what can be run in the different yarn modes. For 
> instances, the interactive shell only work in yarn client mode, recently with 
> https://github.com/apache/spark/pull/3976 users can run python scripts in 
> cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5689) Document what can be run in different YARN modes

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492349#comment-14492349
 ] 

Apache Spark commented on SPARK-5689:
-

User 'Sephiroth-Lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5490

> Document what can be run in different YARN modes
> 
>
> Key: SPARK-5689
> URL: https://issues.apache.org/jira/browse/SPARK-5689
> Project: Spark
>  Issue Type: Documentation
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> We should document what can be run in the different yarn modes. For 
> instances, the interactive shell only work in yarn client mode, recently with 
> https://github.com/apache/spark/pull/3976 users can run python scripts in 
> cluster mode, etc..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results.

2015-04-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492364#comment-14492364
 ] 

Micael Capitão commented on SPARK-6800:
---

The above pull request seem to only fix the upper and lower bounds issue. There 
is still the intermediate queries issue that may result in repeated rows being 
fetched from a DB.

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions 
> gives incorrect results.
> --
>
> Key: SPARK-6800
> URL: https://issues.apache.org/jira/browse/SPARK-6800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
> Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, 
> Scala 2.10
>Reporter: Micael Capitão
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl = "jdbc:derby:memory:PeopleDB;create=true"
> val conn = DriverManager.getConnection(jdbcUrl)
> val stmt = conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAYS AS 
> IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)")
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like this:
> {code}
> val people = sqlContext.jdbc(url = jdbcUrl, table = "Person",
>   columnName = "age", lowerBound = 0, upperBound = 40, numPartitions = 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME AGE
> 3 Ana Rita Costa   12 
> 5 Miguel Costa 15 
> 6 Anabela Sintra   13 
> 2 Lurdes Pereira   23 
> 4 Armando Pereira  32 
> 1 Armando Carvalho 50 
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I get a 
> person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the 
> WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >= 4  AND age < 8,1
> (2) age >= 8  AND age < 12,2
> (3) age >= 12 AND age < 16,3
> (4) age >= 16 AND age < 20,4
> (5) age >= 20 AND age < 24,5
> (6) age >= 24 AND age < 28,6
> (7) age >= 28 AND age < 32,7
> (8) age >= 32 AND age < 36,8
> (9) age >= 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result in 
> repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have something 
> like this:
> {code}
> val jdbcRdd = new JdbcRDD(sc, () => DriverManager.getConnection(jdbcUrl),
>   "SELECT * FROM Person WHERE age >= ? and age <= ?", 0, 40, 10,
>   rs => (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people = jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAMEAGE
> 3 Ana Rita Costa  12 
> 5 Miguel Costa15 
> 6 Anabela Sintra  13 
> 2 Lurdes Pereira  23 
> 4 Armando Pereira 32 
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the 
> {{getPartitions}} I've found it generates the following:
> {code}
> (0) age >= 0  AND age <= 3
> (1) age >= 4  AND age <= 7
> (2) age >= 8  AND age <= 11
> (3) age >= 12 AND age <= 15
> (4) age >= 16 AND age <= 19
> (5) age >= 20 AND age <= 23
> (6) age >= 24 AND age <= 27
> (7) age >= 28 AND age <= 31
> (8) age >= 32 AND age <= 35
> (9) age >= 36 AND age <= 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the 
> Spark SQL version buggy or is this some weird expected behaviour?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6875) Add support for Joda-time types

2015-04-13 Thread Patrick Grandjean (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-6875:
-
Description: 
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile("parquet")

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

Another alternative would be, in addition to annotations, to be able to 
programmatically and dynamically add UDTs.

  was:
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile("parquet")

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types.


> Add support for Joda-time types
> ---
>
> Key: SPARK-6875
> URL: https://issues.apache.org/jira/browse/SPARK-6875
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Grandjean
>
> The need comes from the following use case:
> val objs: RDD[MyClass] = [...]
> val sqlC = new org.apache.spark.sql.SQLContext(sc)
> import sqlC._
> objs.saveAsParquetFile("parquet")
> MyClass contains joda-time fields. When saving to parquet file, an exception 
> is thrown (matchError in ScalaReflection.scala).
> Spark SQL supports java SQL date/time types. This request is to add support 
> for Joda-time types. 
> Another alternative would be, in addition to annotations, to be able to 
> programmatically and dynamically add UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4783:
---

Assignee: (was: Apache Spark)

> System.exit() calls in SparkContext disrupt applications embedding Spark
> 
>
> Key: SPARK-4783
> URL: https://issues.apache.org/jira/browse/SPARK-4783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: David Semeria
>
> A common architectural choice for integrating Spark within a larger 
> application is to employ a gateway to handle Spark jobs. The gateway is a 
> server which contains one or more long-running sparkcontexts.
> A typical server is created with the following pseudo code:
> var continue = true
> while (continue){
>  try {
> server.run() 
>   } catch (e) {
>   continue = log_and_examine_error(e)
> }
> The problem is that sparkcontext frequently calls System.exit when it 
> encounters a problem which means the server can only be re-spawned at the 
> process level, which is much more messy than the simple code above.
> Therefore, I believe it makes sense to replace all System.exit calls in 
> sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492393#comment-14492393
 ] 

Apache Spark commented on SPARK-4783:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5492

> System.exit() calls in SparkContext disrupt applications embedding Spark
> 
>
> Key: SPARK-4783
> URL: https://issues.apache.org/jira/browse/SPARK-4783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: David Semeria
>
> A common architectural choice for integrating Spark within a larger 
> application is to employ a gateway to handle Spark jobs. The gateway is a 
> server which contains one or more long-running sparkcontexts.
> A typical server is created with the following pseudo code:
> var continue = true
> while (continue){
>  try {
> server.run() 
>   } catch (e) {
>   continue = log_and_examine_error(e)
> }
> The problem is that sparkcontext frequently calls System.exit when it 
> encounters a problem which means the server can only be re-spawned at the 
> process level, which is much more messy than the simple code above.
> Therefore, I believe it makes sense to replace all System.exit calls in 
> sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4783) System.exit() calls in SparkContext disrupt applications embedding Spark

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4783:
---

Assignee: Apache Spark

> System.exit() calls in SparkContext disrupt applications embedding Spark
> 
>
> Key: SPARK-4783
> URL: https://issues.apache.org/jira/browse/SPARK-4783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: David Semeria
>Assignee: Apache Spark
>
> A common architectural choice for integrating Spark within a larger 
> application is to employ a gateway to handle Spark jobs. The gateway is a 
> server which contains one or more long-running sparkcontexts.
> A typical server is created with the following pseudo code:
> var continue = true
> while (continue){
>  try {
> server.run() 
>   } catch (e) {
>   continue = log_and_examine_error(e)
> }
> The problem is that sparkcontext frequently calls System.exit when it 
> encounters a problem which means the server can only be re-spawned at the 
> process level, which is much more messy than the simple code above.
> Therefore, I believe it makes sense to replace all System.exit calls in 
> sparkcontext with the throwing of a fatal error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6875) Add support for Joda-time types

2015-04-13 Thread Patrick Grandjean (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Grandjean updated SPARK-6875:
-
Description: 
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile("parquet")

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

It is possible to define UDT's using the @SQLUserDefinedType annotation. 
However, in addition to annotations, it would be nice to be able to 
programmatically/dynamically add UDTs.

  was:
The need comes from the following use case:

val objs: RDD[MyClass] = [...]

val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._

objs.saveAsParquetFile("parquet")

MyClass contains joda-time fields. When saving to parquet file, an exception is 
thrown (matchError in ScalaReflection.scala).

Spark SQL supports java SQL date/time types. This request is to add support for 
Joda-time types. 

Another alternative would be, in addition to annotations, to be able to 
programmatically and dynamically add UDTs.


> Add support for Joda-time types
> ---
>
> Key: SPARK-6875
> URL: https://issues.apache.org/jira/browse/SPARK-6875
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Patrick Grandjean
>
> The need comes from the following use case:
> val objs: RDD[MyClass] = [...]
> val sqlC = new org.apache.spark.sql.SQLContext(sc)
> import sqlC._
> objs.saveAsParquetFile("parquet")
> MyClass contains joda-time fields. When saving to parquet file, an exception 
> is thrown (matchError in ScalaReflection.scala).
> Spark SQL supports java SQL date/time types. This request is to add support 
> for Joda-time types. 
> It is possible to define UDT's using the @SQLUserDefinedType annotation. 
> However, in addition to annotations, it would be nice to be able to 
> programmatically/dynamically add UDTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6352.
---
  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

Resolved by https://github.com/apache/spark/pull/5042

[~pwendell] Tried to assign this ticket to [~pllee], but couldn't put his name 
in the Assignee field. Do we need to set some privilege stuff?

> Supporting non-default OutputCommitter when using saveAsParquetFile
> ---
>
> Key: SPARK-6352
> URL: https://issues.apache.org/jira/browse/SPARK-6352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.1, 1.3.0
>Reporter: Pei-Lun Lee
> Fix For: 1.4.0
>
>
> SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
> be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
> have a fully customizable OutputCommitter solution, at least adding something 
> like a DirectParquetOutputCommitter and letting users choose between this and 
> the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6207) YARN secure cluster mode doesn't obtain a hive-metastore token

2015-04-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6207.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> YARN secure cluster mode doesn't obtain a hive-metastore token 
> ---
>
> Key: SPARK-6207
> URL: https://issues.apache.org/jira/browse/SPARK-6207
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL, YARN
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
> Environment: YARN
>Reporter: Doug Balog
> Fix For: 1.4.0
>
>
> When running a spark job, on YARN in secure mode, with "--deploy-mode 
> cluster",  org.apache.spark.deploy.yarn.Client() does not obtain a delegation 
> token to the hive-metastore. Therefore any attempts to talk to the 
> hive-metastore fail with a "GSSException: No valid credentials provided..."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)
pankaj arora created SPARK-6880:
---

 Summary: Spark Shutdowns with NoSuchElementException when running 
parallel collect on cachedRDD
 Key: SPARK-6880
 URL: https://issues.apache.org/jira/browse/SPARK-6880
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: CentOs6.0, java7
Reporter: pankaj arora
 Fix For: 1.3.2


Spark Shutdowns with NoSuchElementException when running parallel collect on 
cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Hao (JIRA)
Hao created SPARK-6881:
--

 Summary: Change the checkpoint directory name from checkpoints to 
checkpoint
 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial


Name "checkpoint" instead of "checkpoints" is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6881:
---

Assignee: Apache Spark

> Change the checkpoint directory name from checkpoints to checkpoint
> ---
>
> Key: SPARK-6881
> URL: https://issues.apache.org/jira/browse/SPARK-6881
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Hao
>Assignee: Apache Spark
>Priority: Trivial
>
> Name "checkpoint" instead of "checkpoints" is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492548#comment-14492548
 ] 

Apache Spark commented on SPARK-6881:
-

User 'hlin09' has created a pull request for this issue:
https://github.com/apache/spark/pull/5493

> Change the checkpoint directory name from checkpoints to checkpoint
> ---
>
> Key: SPARK-6881
> URL: https://issues.apache.org/jira/browse/SPARK-6881
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Hao
>Priority: Trivial
>
> Name "checkpoint" instead of "checkpoints" is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6881:
---

Assignee: (was: Apache Spark)

> Change the checkpoint directory name from checkpoints to checkpoint
> ---
>
> Key: SPARK-6881
> URL: https://issues.apache.org/jira/browse/SPARK-6881
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Hao
>Priority: Trivial
>
> Name "checkpoint" instead of "checkpoints" is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6880:
-
Target Version/s:   (was: 1.3.2)
   Fix Version/s: (was: 1.3.2)

(Don't assign Target / Fix Version)

This is not a valid JIRA, as there is no detail. If you intend to add detail 
later, OK, but please next time wait until you have all of that information 
ready before opening a JIRA. Otherwise I'm going to close this.

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6765) Turn scalastyle on for test code

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6765.

   Resolution: Fixed
Fix Version/s: 1.4.0

> Turn scalastyle on for test code
> 
>
> Key: SPARK-6765
> URL: https://issues.apache.org/jira/browse/SPARK-6765
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.4.0
>
>
> We should turn scalastyle on for test code. Test code should be as important 
> as main code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-04-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492618#comment-14492618
 ] 

Yin Huai commented on SPARK-5791:
-

[~jameszhouyi] Thank you for the update :) For Hive, it also used Parquet in 
your last run, right?

> [Spark SQL] show poor performance when multiple table do join operation
> ---
>
> Key: SPARK-5791
> URL: https://issues.apache.org/jira/browse/SPARK-5791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
> Attachments: Physcial_Plan_Hive.txt, 
> Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-13 Thread Andrew Lee (JIRA)
Andrew Lee created SPARK-6882:
-

 Summary: Spark ThriftServer2 Kerberos failed encountering 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
 Key: SPARK-6882
 URL: https://issues.apache.org/jira/browse/SPARK-6882
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0, 1.2.1
 Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
* Apache Hive 0.13.1
* Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
* Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
Reporter: Andrew Lee


When Kerberos is enabled, I get the following exceptions. 
{code}
2015-03-13 18:26:05,363 ERROR 
org.apache.hive.service.cli.thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(93)) - Error: 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
{code}

I tried it in
* Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
* Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

with
* Apache Hive 0.13.1
* Apache Hadoop 2.4.1

Build command
{code}
mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
-Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
install
{code}

When starting Spark ThriftServer in {{yarn-client}} mode, the command to start 
thriftserver looks like this

{code}
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
hive.server2.thrift.bind.host=$(hostname) --master yarn-client
{code}

{{hostname}} points to the current hostname of the machine I'm using.

Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
{code}
2015-03-13 18:26:05,363 ERROR 
org.apache.hive.service.cli.thrift.ThriftCLIService 
(ThriftBinaryCLIService.java:run(93)) - Error: 
java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: 
[auth-int, auth-conf, auth]
at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
at 
org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
at 
org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
at 
org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
at java.lang.Thread.run(Thread.java:744)
{code}

I'm wondering if this is due to the same problem described in HIVE-8154 
HIVE-7620 due to an older code based for the Spark ThriftServer?

Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to run 
against a Kerberos cluster (Apache 2.4.1).

My hive-site.xml looks like the following for spark/conf.
The kerberos keytab and tgt are configured correctly, I'm able to connect to 
metastore, but the subsequent steps failed due to the exception.
{code}

  hive.semantic.analyzer.factory.impl
  org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory


  hive.metastore.execute.setugi
  true


  hive.stats.autogather
  false


  hive.session.history.enabled
  true


  hive.querylog.location
  /tmp/home/hive/log/${user.name}


  hive.exec.local.scratchdir
  /tmp/hive/scratch/${user.name}


  hive.metastore.uris
  thrift://somehostname:9083



  hive.server2.authentication
  KERBEROS


  hive.server2.authentication.kerberos.principal
  ***


  hive.server2.authentication.kerberos.keytab
  ***


  hive.server2.thrift.sasl.qop
  auth
  Sasl QOP value; one of 'auth', 'auth-int' and 
'auth-conf'


  hive.server2.enable.impersonation
  Enable user impersonation for HiveServer2
  true



  hive.metastore.sasl.enabled
  true


  hive.metastore.kerberos.keytab.file
  ***


  hive.metastore.kerberos.principal
  ***


  hive.metastore.cache.pinobjtypes
  Table,Database,Type,FieldSchema,Order


  hdfs_sentinel_file
  ***


  hive.metastore.warehouse.dir
  /hive


  hive.metastore.client.socket.timeout
  600


  hive.warehouse.subdir.inherit.perms
  true

{code}

Here, I'm attaching a more detail logs from Spark 1.3 rc1.
{code}
2015-04-13 16:37:20,688 INFO  org.apache.hadoop.security.UserGroupInformation 
(UserGroupInformation.java:loginUserFromKeytab(893)) - Login successful for 
user hiveserver/alee-vpc2-dt.test.testhost@test.testhost.com using keytab 
file /etc/testhost/secrets/hiveserver.keytab
2015-04-13 16:37:20,689 INFO  org.apache.hive.service.AbstractService 
(SparkSQLSessionManager.scala:init(43)) - HiveServer2: Async execution pool 
size 100
2015-04-13 16:37:20,691 INFO  org.apache.hive.service.AbstractService 
(AbstractService.java:init(89)) - Service:OperationManager is inited.
2015-04-13 16:37:20,691 INFO  org.apache.hive.service.AbstractService 
(SparkSQLCLIService.scala:initCompositeService(85)) - Service: SessionManager 
is inited.
2015-04-13 16:37:20,692 INFO  org.apache.hive.service.AbstractService 
(SparkSQLCLIService.scala:initCompositeService(85

[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6880:
---

Assignee: Apache Spark

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>Assignee: Apache Spark
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492622#comment-14492622
 ] 

Apache Spark commented on SPARK-6880:
-

User 'pankajarora12' has created a pull request for this issue:
https://github.com/apache/spark/pull/5494

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6880:
---

Assignee: (was: Apache Spark)

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pankaj arora updated SPARK-6880:

Description: 
Spark Shutdowns with NoSuchElementException when running parallel collect on 
cachedRDDs

Below is the stack trace

15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
failed; shutting down SparkContext
java.util.NoSuchElementException: key not found: 28
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)


  was:Spark Shutdowns with NoSuchElementException when running parallel collect 
on cachedRDDs


> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs
> Below is the stack trace
> 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed; shutting down SparkContext
> java.util.NoSuchElementException: key not found: 28
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$an

[jira] [Commented] (SPARK-6880) Spark Shutdowns with NoSuchElementException when running parallel collect on cachedRDD

2015-04-13 Thread pankaj arora (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492642#comment-14492642
 ] 

pankaj arora commented on SPARK-6880:
-

Sean,
Sorry for missing stack trace. Added that in description.

> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDD
> --
>
> Key: SPARK-6880
> URL: https://issues.apache.org/jira/browse/SPARK-6880
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1
> Environment: CentOs6.0, java7
>Reporter: pankaj arora
>
> Spark Shutdowns with NoSuchElementException when running parallel collect on 
> cachedRDDs
> Below is the stack trace
> 15/03/27 11:12:43 ERROR DAGSchedulerActorSupervisor: eventProcesserActor 
> failed; shutting down SparkContext
> java.util.NoSuchElementException: key not found: 28
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:58)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:808)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:778)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:781)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:780)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:780)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:762)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1389)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6823) Add a model.matrix like capability to DataFrames (modelDataFrame)

2015-04-13 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492661#comment-14492661
 ] 

Shivaram Venkataraman commented on SPARK-6823:
--

I think the goal of the original JIRA on SparkR was to have a high-level API 
that'll allow users to express this . We could have this higher-level API in a 
DataFrame or just provide a wrapper around OneHotEncoder + VectorAssembler in 
the SparkR ML integration work. I think the second one sounds better to me, but 
 [~cafreeman] and Dan Putler have been looking at this and might be able to add 
more.

> Add a model.matrix like capability to DataFrames (modelDataFrame)
> -
>
> Key: SPARK-6823
> URL: https://issues.apache.org/jira/browse/SPARK-6823
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Shivaram Venkataraman
>
> Currently Mllib modeling tools work only with double data. However, data 
> tables in practice often have a set of categorical fields (factors in R), 
> that need to be converted to a set of 0/1 indicator variables (making the 
> data actually used in a modeling algorithm completely numeric). In R, this is 
> handled in modeling functions using the model.matrix function. Similar 
> functionality needs to be available within Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6859) Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-13 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492667#comment-14492667
 ] 

Cheng Lian commented on SPARK-6859:
---

[~rdblue] pointed out 1 fact that I missed in PARQUET-251: we need to work out 
a way to ignore (binary) min/max stats for all existing data.

So from Spark SQL side, we have to disable filter push-down for binary columns.

> Parquet File Binary column statistics error when reuse byte[] among rows
> 
>
> Key: SPARK-6859
> URL: https://issues.apache.org/jira/browse/SPARK-6859
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0, 1.4.0
>Reporter: Yijie Shen
>Priority: Minor
>
> Suppose I create a dataRDD which extends RDD\[Row\], and each row is 
> GenericMutableRow(Array(Int, Array\[Byte\])). A same Array\[Byte\] object is 
> reused among rows but has different content each time. When I convert it to a 
> dataFrame and save it as Parquet File, the file's row group statistic(max & 
> min) of Binary column would be wrong.
> \\
> \\
> Here is the reason: In Parquet, BinaryStatistic just keep max & min as 
> parquet.io.api.Binary references, Spark sql would generate a new Binary 
> backed by the same Array\[Byte\] passed from row.
>   
> | |reference| |backed| |  
> |max: Binary|-->|ByteArrayBackedBinary|-->|Array\[Byte\]|
> Therefore, each time parquet updating row group's statistic, max & min would 
> always refer to the same Array\[Byte\], which has new content each time. When 
> parquet decides to save it into file, the last row's content would be saved 
> as both max & min.
> \\
> \\
> It seems it is a parquet bug because it's parquet's responsibility to update 
> statistics correctly.
> But not quite sure. Should I report it as a bug in parquet JIRA? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6882) Spark ThriftServer2 Kerberos failed encountering java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6882:
-
Component/s: SQL

> Spark ThriftServer2 Kerberos failed encountering 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> 
>
> Key: SPARK-6882
> URL: https://issues.apache.org/jira/browse/SPARK-6882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1, 1.3.0
> Environment: * Apache Hadoop 2.4.1 with Kerberos Enabled
> * Apache Hive 0.13.1
> * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
> * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
>Reporter: Andrew Lee
>
> When Kerberos is enabled, I get the following exceptions. 
> {code}
> 2015-03-13 18:26:05,363 ERROR 
> org.apache.hive.service.cli.thrift.ThriftCLIService 
> (ThriftBinaryCLIService.java:run(93)) - Error: 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> {code}
> I tried it in
> * Spark 1.2.1 git commit b6eaf77d4332bfb0a698849b1f5f917d20d70e97
> * Spark 1.3.0 rc1 commit label 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
> with
> * Apache Hive 0.13.1
> * Apache Hadoop 2.4.1
> Build command
> {code}
> mvn -U -X -Phadoop-2.4 -Pyarn -Phive -Phive-0.13.1 -Phive-thriftserver 
> -Dhadoop.version=2.4.1 -Dyarn.version=2.4.1 -Dhive.version=0.13.1 -DskipTests 
> install
> {code}
> When starting Spark ThriftServer in {{yarn-client}} mode, the command to 
> start thriftserver looks like this
> {code}
> ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 --hiveconf 
> hive.server2.thrift.bind.host=$(hostname) --master yarn-client
> {code}
> {{hostname}} points to the current hostname of the machine I'm using.
> Error message in {{spark.log}} from Spark 1.2.1 (1.2 rc1)
> {code}
> 2015-03-13 18:26:05,363 ERROR 
> org.apache.hive.service.cli.thrift.ThriftCLIService 
> (ThriftBinaryCLIService.java:run(93)) - Error: 
> java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> are: [auth-int, auth-conf, auth]
> at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
> at 
> org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
> at 
> org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
> at 
> org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
> at java.lang.Thread.run(Thread.java:744)
> {code}
> I'm wondering if this is due to the same problem described in HIVE-8154 
> HIVE-7620 due to an older code based for the Spark ThriftServer?
> Any insights are appreciated. Currently, I can't get Spark ThriftServer2 to 
> run against a Kerberos cluster (Apache 2.4.1).
> My hive-site.xml looks like the following for spark/conf.
> The kerberos keytab and tgt are configured correctly, I'm able to connect to 
> metastore, but the subsequent steps failed due to the exception.
> {code}
> 
>   hive.semantic.analyzer.factory.impl
>   org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
>   hive.stats.autogather
>   false
> 
> 
>   hive.session.history.enabled
>   true
> 
> 
>   hive.querylog.location
>   /tmp/home/hive/log/${user.name}
> 
> 
>   hive.exec.local.scratchdir
>   /tmp/hive/scratch/${user.name}
> 
> 
>   hive.metastore.uris
>   thrift://somehostname:9083
> 
> 
> 
>   hive.server2.authentication
>   KERBEROS
> 
> 
>   hive.server2.authentication.kerberos.principal
>   ***
> 
> 
>   hive.server2.authentication.kerberos.keytab
>   ***
> 
> 
>   hive.server2.thrift.sasl.qop
>   auth
>   Sasl QOP value; one of 'auth', 'auth-int' and 
> 'auth-conf'
> 
> 
>   hive.server2.enable.impersonation
>   Enable user impersonation for HiveServer2
>   true
> 
> 
> 
>   hive.metastore.sasl.enabled
>   true
> 
> 
>   hive.metastore.kerberos.keytab.file
>   ***
> 
> 
>   hive.metastore.kerberos.principal
>   ***
> 
> 
>   hive.metastore.cache.pinobjtypes
>   Table,Database,Type,FieldSchema,Order
> 
> 
>   hdfs_sentinel_file
>   ***
> 
> 
>   hive.metastore.warehouse.dir
>   /hive
> 
> 
>   hive.metastore.client.socket.timeout
>   600
> 
> 
>   hive.warehouse.subdir.inherit.perms
>   true
> 
> {code}
> Here, I'm attaching a more detail logs from Spark 1.3 rc1.
> {code}
> 2015-04-13 16:37:20,688 INFO  org.apache.hadoop.security.UserGroupInformation 
> (UserGroupInformation.java:loginUserFromKeytab(893)) - Login successful for 
> user hiveserver/alee-vpc2-dt.test.testhost@test.testhost.com using keytab 
> file /etc/testhost/s

[jira] [Created] (SPARK-6883) Fork pyspark's cloudpickle as a separate dependency

2015-04-13 Thread Kyle Kelley (JIRA)
Kyle Kelley created SPARK-6883:
--

 Summary: Fork pyspark's cloudpickle as a separate dependency
 Key: SPARK-6883
 URL: https://issues.apache.org/jira/browse/SPARK-6883
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Kyle Kelley


IPython, pyspark, picloud/multyvac/cloudpipe all rely on cloudpickle from 
various sources (cloud, pyspark, and multyvac correspondingly). It would be 
great to have this as a separately maintained project that can:

* Work with Python3
* Add tests!
* Use higher order pickling (when on Python3)
* Be installed with pip

We're starting this off at the PyCon sprints under 
https://github.com/cloudpipe/cloudpickle. We'd like to coordinate with PySpark 
to make it work across all the above mentioned projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6352:
--
Assignee: Pei-Lun Lee

> Supporting non-default OutputCommitter when using saveAsParquetFile
> ---
>
> Key: SPARK-6352
> URL: https://issues.apache.org/jira/browse/SPARK-6352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.1, 1.3.0
>Reporter: Pei-Lun Lee
>Assignee: Pei-Lun Lee
> Fix For: 1.4.0
>
>
> SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
> be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
> have a fully customizable OutputCommitter solution, at least adding something 
> like a DirectParquetOutputCommitter and letting users choose between this and 
> the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6352) Supporting non-default OutputCommitter when using saveAsParquetFile

2015-04-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492699#comment-14492699
 ] 

Josh Rosen commented on SPARK-6352:
---

[~lian cheng], we can only assign tickets to users who have the proper role in 
Spark's JIRA permissions.  I've added [~pllee] to the "Contributors" role and 
will assign this ticket to them. 

> Supporting non-default OutputCommitter when using saveAsParquetFile
> ---
>
> Key: SPARK-6352
> URL: https://issues.apache.org/jira/browse/SPARK-6352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.1, 1.2.1, 1.3.0
>Reporter: Pei-Lun Lee
> Fix For: 1.4.0
>
>
> SPARK-3595 only handles custom OutputCommitter for saveAsHadoopFile, it can 
> be nice to have similar behavior in saveAsParquetFile. It maybe difficult to 
> have a fully customizable OutputCommitter solution, at least adding something 
> like a DirectParquetOutputCommitter and letting users choose between this and 
> the default should be enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6662.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Cheolsoo Park

> Allow variable substitution in spark.yarn.historyServer.address
> ---
>
> Key: SPARK-6662
> URL: https://issues.apache.org/jira/browse/SPARK-6662
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
> Fix For: 1.4.0
>
>
> In Spark on YARN, explicit hostname and port number need to be set for 
> "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If 
> the history server address is known and static, this is usually not a problem.
> But in cloud, that is usually not true. Particularly in EMR, the history 
> server always runs on the same node as with RM. So I could simply set it to 
> {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
> allowed.
> In fact, Hadoop configuration already implements variable substitution, so if 
> this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492839#comment-14492839
 ] 

Max Kaznady commented on SPARK-3727:


I implemented the same thing but for PySpark. Since there is no existing 
function, should I just call the function "predict_proba" like in sklearn? 

Also, does it make sense to open a new ticket for this, since it's so specific?

Thanks,
Max

> DecisionTree, RandomForest: More prediction functionality
> -
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5988.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5450
[https://github.com/apache/spark/pull/5450]

> Model import/export for PowerIterationClusteringModel
> -
>
> Key: SPARK-5988
> URL: https://issues.apache.org/jira/browse/SPARK-5988
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
> Fix For: 1.4.0
>
>
> Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)
Max Kaznady created SPARK-6884:
--

 Summary: random forest predict probabilities functionality (like 
in sklearn)
 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
 Environment: cross-platform
Reporter: Max Kaznady


Currently, there is no way to extract the class probabilities from the 
RandomForest classifier. I implemented a probability predictor by counting 
votes from individual trees and adding up their votes for "1" and then dividing 
by the total number of votes.

I opened this ticked to keep track of changes. Will update once I push my code 
to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492867#comment-14492867
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

Do you mean (a) tests to make sure the examples work, or (b) treating the 
examples as tests themselves?  We should not do (b) since it mixes tests and 
examples.

For (a), we don't have a great solution currently, although I think we should 
(at some point) add a script for running all of the examples to make sure they 
run.  I don't think we need performance tests for examples since they are meant 
to be short usage examples, not end solutions or applications.

> Deprecate static train and use builder instead for Scala/Java
> -
>
> Key: SPARK-6682
> URL: https://issues.apache.org/jira/browse/SPARK-6682
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> In MLlib, we have for some time been unofficially moving away from the old 
> static train() methods and moving towards builder patterns.  This JIRA is to 
> discuss this move and (hopefully) make it official.
> "Old static train()" API:
> {code}
> val myModel = NaiveBayes.train(myData, ...)
> {code}
> "New builder pattern" API:
> {code}
> val nb = new NaiveBayes().setLambda(0.1)
> val myModel = nb.train(myData)
> {code}
> Pros of the builder pattern:
> * Much less code when algorithms have many parameters.  Since Java does not 
> support default arguments, we required *many* duplicated static train() 
> methods (for each prefix set of arguments).
> * Helps to enforce default parameters.  Users should ideally not have to even 
> think about setting parameters if they just want to try an algorithm quickly.
> * Matches spark.ml API
> Cons of the builder pattern:
> * In Python APIs, static train methods are more "Pythonic."
> Proposal:
> * Scala/Java: We should start deprecating the old static train() methods.  We 
> must keep them for API stability, but deprecating will help with API 
> consistency, making it clear that everyone should use the builder pattern.  
> As we deprecate them, we should make sure that the builder pattern supports 
> all parameters.
> * Python: Keep static train methods.
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492871#comment-14492871
 ] 

Max Kaznady commented on SPARK-3727:


I thought it would be more fitting to separate this: 
https://issues.apache.org/jira/browse/SPARK-6884

> DecisionTree, RandomForest: More prediction functionality
> -
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492868#comment-14492868
 ] 

Max Kaznady commented on SPARK-6884:


Implemented a prototype, testing mapReduce code.

> random forest predict probabilities functionality (like in sklearn)
> ---
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.4.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Affects Version/s: (was: 1.4.0)
   1.3.0

> random forest predict probabilities functionality (like in sklearn)
> ---
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492887#comment-14492887
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

Thanks for your initial works on this ticket!  The main issue with this 
extension is API stability: Modifying the existing classes will also make us 
have to update model save/load versioning, default constructors to ensure 
binary compatibility, etc.

I just linked a JIRA which discusses updating the tree and ensemble APIs under 
the spark.ml package, which will permit us to redesign the APIs (and make it 
easier to specify class probabilities or stats for regression).  What I'd like 
to do is get the tree API updates in (this week), and then we could work 
together to make the class probabilities available under the new API.

Does that sound good?

Also, if you're new to contributing to Spark, please make sure to check out: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

Thanks!

> DecisionTree, RandomForest: More prediction functionality
> -
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Priority: Critical  (was: Major)

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>Priority: Critical
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492888#comment-14492888
 ] 

Patrick Wendell commented on SPARK-6703:


Hey [~ilganeli] - sure thing. I've pinged a couple of people to provide 
feedback on the design. Overall I think it won't be a complicated feature to 
implement. I've added you as the assignee. One note, if it gets very close to 
the 1.4 code freeze I may need to help take it across the finish line. But for 
now why don't you go ahead, I think we'll be fine.

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Assignee: Ilya Ganelin

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492892#comment-14492892
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

Is this not a duplicate of [SPARK-3727]?  Perhaps the best way to split up the 
work will be to make a subtask for trees, and a separate subtask for ensembles. 
 I'll go ahead and do that.

> random forest predict probabilities functionality (like in sklearn)
> ---
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6885) Decision trees: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6885:


 Summary: Decision trees: predict class probabilities
 Key: SPARK-6885
 URL: https://issues.apache.org/jira/browse/SPARK-6885
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Under spark.ml, have DecisionTreeClassifier (currently being added) extend 
ProbabilisticClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Summary: Random forest: predict class probabilities  (was: random forest 
predict probabilities functionality (like in sklearn))

> Random forest: predict class probabilities
> --
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3727

> random forest predict probabilities functionality (like in sklearn)
> ---
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492898#comment-14492898
 ] 

Patrick Wendell commented on SPARK-6703:


/cc [~velvia]

> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>Priority: Critical
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492906#comment-14492906
 ] 

Max Kaznady commented on SPARK-3727:


Yes, probabilities have to be added to other models too, like 
LogisticRegression. Right now they are hardcoded in two places but not 
outputted in PySpark.

I think is makes sense to split into PySpark, then classification, then 
probabilities, and then group different types of algorithms, all of which 
output probabilities: Logistic Regression, Random Forest, etc.

Can also add probabilities for trees by counting the number of leaf 1's and 0's.

What do you think?

> DecisionTree, RandomForest: More prediction functionality
> -
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492904#comment-14492904
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

I'd recommend: Under spark.ml, have RandomForestClassifier (currently being 
added) extend ProbabilisticClassifier.

> Random forest: predict class probabilities
> --
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3727:
-
Summary: Trees and ensembles: More prediction functionality  (was: 
DecisionTree, RandomForest: More prediction functionality)

> Trees and ensembles: More prediction functionality
> --
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492909#comment-14492909
 ] 

Ilya Ganelin commented on SPARK-6703:
-

Patrick - what¹s the time line for the 1.4 release? Just want to have a
sense for it so I can schedule accordingly.

Thank you, 
Ilya Ganelin










The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



> Provide a way to discover existing SparkContext's
> -
>
> Key: SPARK-6703
> URL: https://issues.apache.org/jira/browse/SPARK-6703
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Patrick Wendell
>Assignee: Ilya Ganelin
>Priority: Critical
>
> Right now it is difficult to write a Spark application in a way that can be 
> run independently and also be composed with other Spark applications in an 
> environment such as the JobServer, notebook servers, etc where there is a 
> shared SparkContext.
> It would be nice to provide a rendez-vous point so that applications can 
> learn whether an existing SparkContext already exists before creating one.
> The most simple/surgical way I see to do this is to have an optional static 
> SparkContext singleton that people can be retrieved as follows:
> {code}
> val sc = SparkContext.getOrCreate(conf = new SparkConf())
> {code}
> And you could also have a setter where some outer framework/server can set it 
> for use by multiple downstream applications.
> A more advanced version of this would have some named registry or something, 
> but since we only support a single SparkContext in one JVM at this point 
> anyways, this seems sufficient and much simpler. Another advanced option 
> would be to allow plugging in some other notion of configuration you'd pass 
> when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Assignee: (was: Max Kaznady)

> Random forest: predict class probabilities
> --
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Assignee: Max Kaznady

> Random forest: predict class probabilities
> --
>
> Key: SPARK-6884
> URL: https://issues.apache.org/jira/browse/SPARK-6884
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
> Environment: cross-platform
>Reporter: Max Kaznady
>Assignee: Max Kaznady
>  Labels: prediction, probability, randomforest, tree
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Currently, there is no way to extract the class probabilities from the 
> RandomForest classifier. I implemented a probability predictor by counting 
> votes from individual trees and adding up their votes for "1" and then 
> dividing by the total number of votes.
> I opened this ticked to keep track of changes. Will update once I push my 
> code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492928#comment-14492928
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

[~maxkaznady] [~mqk] I split this into some subtasks, and we can add others 
later (for boosted trees, for regression, etc.).  It will be great if you can 
follow the spark.ml tree API JIRA (linked above) and take a look at it once 
it's posted.  That (and the ProbabilisticClassifier class) will give you an 
idea of what's entailed in adding these under the Pipelines API.

Do you have preferences on how to split up these tasks?  If you can figure that 
out, I'll be happy to assign them.  Thanks!

> Trees and ensembles: More prediction functionality
> --
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492931#comment-14492931
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

[~maxkaznady] Implementations should be done in Scala; the PySpark API will be 
a wrapper.  The API update JIRA I'm referencing should clear up some of the 
other questions.

> Trees and ensembles: More prediction functionality
> --
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492959#comment-14492959
 ] 

Max Kaznady commented on SPARK-6113:


[~josephkb] Is it possible to host the API Design doc on something other than 
Google Docs? My (and most other) corporate policies forbid access to Google 
Docs, so I cannot download the file.

> Stabilize DecisionTree and ensembles APIs
> -
>
> Key: SPARK-6113
> URL: https://issues.apache.org/jira/browse/SPARK-6113
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
> GradientBoostedTrees) have been experimental for a long time.  The API has 
> become very convoluted because trees and ensembles have many, many variants, 
> some of which we have added incrementally without a long-term design.
> *Proposal*: This JIRA is for discussing changes required to finalize the 
> APIs.  After we discuss, I will make a PR to update the APIs and make them 
> non-Experimental.  This will require making many breaking changes; see the 
> design doc for details.
> [Design doc | 
> https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
>  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >