date:20141010


 [ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-1503:


Assignee: Xiangrui Meng

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method


[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166465#comment-14166465
 ] 

Xiangrui Meng commented on SPARK-1503:
--

[~staple] Thanks for picking up this JIRA! TFOCS is a good place to start. We 
can support AT (Auslender and Teboulle) update, line search, and restart in the 
first version. It would be nice to take generic composite objective functions.

Please note that this could become a big task. We definitely need to go through 
the design first.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1503) Implement Nesterov's accelerated first-order method


 [ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1503:
-
Assignee: Aaron Staple  (was: Xiangrui Meng)

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3897) Scala style: format example code

2014-10-10 Thread sjk (JIRA)

sjk created SPARK-3897:
--

 Summary: Scala style: format example code
 Key: SPARK-3897
 URL: https://issues.apache.org/jira/browse/SPARK-3897
 Project: Spark
  Issue Type: Sub-task
Reporter: sjk






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3897) Scala style: format example code

2014-10-10 Thread sjk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sjk updated SPARK-3897:
---

https://github.com/apache/spark/pull/2754

 Scala style: format example code
 

 Key: SPARK-3897
 URL: https://issues.apache.org/jira/browse/SPARK-3897
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3897) Scala style: format example code

2014-10-10 Thread sjk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sjk updated SPARK-3897:
---
Description: https://github.com/apache/spark/pull/2754

 Scala style: format example code
 

 Key: SPARK-3897
 URL: https://issues.apache.org/jira/browse/SPARK-3897
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Reporter: sjk

 https://github.com/apache/spark/pull/2754



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3421) StructField.toString should quote the name field to allow arbitrary character as struct field name


 [ 
https://issues.apache.org/jira/browse/SPARK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-3421.
---
Resolution: Fixed

 StructField.toString should quote the name field to allow arbitrary character 
 as struct field name
 --

 Key: SPARK-3421
 URL: https://issues.apache.org/jira/browse/SPARK-3421
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian

 The original use case is something like this:
 {code}
 // JSON snippet with illegal characters in field names
 val json =
   { a(b): { c(d): hello } } ::
   { a(b): { c(d): world } } ::
   Nil
 val jsonSchemaRdd = sqlContext.jsonRDD(sparkContext.makeRDD(json))
 jsonSchemaRdd.saveAsParquetFile(/tmp/file.parquet)
 java.lang.Exception: java.lang.RuntimeException: Unsupported dataType: 
 StructType(ArrayBuffer(StructField(a(b),StructType(ArrayBuffer(StructField(c(d),StringType,true))),true))),
  [1.37] failure: `,' expected but `(' found
 {code}
 The reason is that, the {{DataType}} parser only allows {{\[a-zA-Z0-9_\]*}} 
 as struct field name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2805) Update akka to version 2.3.4

2014-10-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2805.

Resolution: Fixed

We've merged again with some modifications and we'll see if it works well in 
the maven builds.

 Update akka to version 2.3.4
 

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati
Assignee: Anand Avati
 Fix For: 1.2.0


 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-10-10 Thread Denis Serduik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555
 ] 

Denis Serduik commented on SPARK-2019:
--

I have noticed the same problem with workers behavior. My installation: Spark 
1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when 
there was an error while serialization the closure.

 Spark workers die/disappear when job fails for nearly any reason
 

 Key: SPARK-2019
 URL: https://issues.apache.org/jira/browse/SPARK-2019
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

 We either have to reboot all the nodes, or run 'sudo service spark-worker 
 restart' across our cluster.  I don't think this should happen - the job 
 failures are often not even that bad.  There is a 5 upvoted SO question here: 
 http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails
  
 We shouldn't be giving restart privileges to our devs, and therefore our 
 sysadm has to frequently restart the workers.  When the sysadm is not around, 
 there is nothing our devs can do.
 Many thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-10-10 Thread Denis Serduik (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555
]

Denis Serduik edited comment on SPARK-2019 at 10/10/14 8:39 AM:

I have noticed the same problem with workers behavior. My installation: Spark
1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when
there was an error while serialization the closure. Also please notice that we
run Spark in coarse-grained mode

was (Author: dmaverick):
I have noticed the same problem with workers behavior. My installation: Spark
1.0.2-hadoop2.0.0-mr1-cdh4.2.0 on Mesos 0.13. In my case, workers fail when
there was an error while serialization the closure.

Spark workers die/disappear when job fails for nearly any reason

Key: SPARK-2019
URL: https://issues.apache.org/jira/browse/SPARK-2019
Project: Spark
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

We either have to reboot all the nodes, or run 'sudo service spark-worker
restart' across our cluster. I don't think this should happen - the job
failures are often not even that bad. There is a 5 upvoted SO question here:
http://stackoverflow.com/questions/22031006/spark-0-9-0-worker-keeps-dying-in-standalone-mode-when-job-fails

We shouldn't be giving restart privileges to our devs, and therefore our
sysadm has to frequently restart the workers. When the sysadm is not around,
there is nothing our devs can do.
Many thanks

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2019) Spark workers die/disappear when job fails for nearly any reason

2014-10-10 Thread Denis Serduik (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166555#comment-14166555
]

Denis Serduik edited comment on SPARK-2019 at 10/10/14 8:40 AM:

Spark workers die/disappear when job fails for nearly any reason

Key: SPARK-2019
URL: https://issues.apache.org/jira/browse/SPARK-2019
Project: Spark
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: sam

We shouldn't be giving restart privileges to our devs, and therefore our
sysadm has to frequently restart the workers. When the sysadm is not around,
there is nothing our devs can do.
Many thanks

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-10-10 Thread AaronLin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166559#comment-14166559
 ] 

AaronLin commented on SPARK-2348:
-

why the issue hasn't been solved? Anyone can help?

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-10 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3889.

   Resolution: Fixed
Fix Version/s: 1.2.0

 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
 ---

 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 Here's the first part of the core dump, possibly caused by a job which 
 shuffles a lot of very small partitions.
 {code}
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
 #
 # JRE version: 7.0_25-b30
 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
 compressed oops)
 # Problematic frame:
 # v  ~StubRoutines::jbyte_disjoint_arraycopy
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # If you would like to submit a bug report, please include
 # instructions on how to reproduce the bug and visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
 #
 ---  T H R E A D  ---
 Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
 worker-170 daemon [_thread_in_Java, id=6783, 
 stack(0x7fa4448ef000,0x7fa4449f)]
 siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
 si_addr=0x7fa428f79000
 {code}
 Here is the only useful content I can find related to JVM and SIGBUS from 
 Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664
 It appears it may be related to disposing byte buffers, which we do in the 
 ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
 them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2014-10-10 Thread AaronLin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166566#comment-14166566
 ] 

AaronLin commented on SPARK-2348:
-

I encounter this issue in sbt building spark1.1.0 (windows7 os), i solved this 
issue by changing one line in spark-class2.cmd
--old-
set JAVA_OPTS=-XX:MaxPermSize=128m %OUR_JAVA_OPTS% -Xms%OUR_JAVA_MEM% 
-Xmx%OUR_JAVA_MEM%
-new--
set JAVA_OPTS=%OUR_JAVA_OPTS% -Djava.library.path=%SPARK_LIBRARY_PATH% 
-Dscala.usejavacp=true -Xms%OUR_JAVA_MEM% -Xmx%OUR_JAVA_MEM% 
--end

it works.

 In Windows having a enviorinment variable named 'classpath' gives error
 ---

 Key: SPARK-2348
 URL: https://issues.apache.org/jira/browse/SPARK-2348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Windows 7 Enterprise
Reporter: Chirag Todarka
Assignee: Chirag Todarka

 Operating System:: Windows 7 Enterprise
 If having enviorinment variable named 'classpath' gives then starting 
 'spark-shell' gives below error::
 mydir\spark\binspark-shell
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
 acces
 sed before init set up.  Assuming no postInit code.
 Failed to initialize compiler: object scala.runtime in compiler mirror not 
 found
 .
 ** Note that as of 2.8 scala does not assume use of the java classpath.
 ** For the old behavior pass -usejavacp to scala, or if using a Settings
 ** object programatically, settings.usejavacp.value = true.
 Exception in thread main java.lang.AssertionError: assertion failed: null
 at scala.Predef$.assert(Predef.scala:179)
 at 
 org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
 la:202)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
 kILoop.scala:929)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
 scala:884)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
 Loader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3898) History Web UI display incorrectly.

zzc created SPARK-3898:
--

 Summary: History Web UI display incorrectly.
 Key: SPARK-3898
 URL: https://issues.apache.org/jira/browse/SPARK-3898
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
 Environment: Spark 1.2.0-snapshot On Yarn
Reporter: zzc


After successfully run an spark application, history web ui display 
incorrectly: 
App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started
Last Updated:2014/10/10 14:50:39 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.


 [ 
https://issues.apache.org/jira/browse/SPARK-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zzc updated SPARK-3898:
---
Fix Version/s: 1.2.0

 History Web UI display incorrectly.
 ---

 Key: SPARK-3898
 URL: https://issues.apache.org/jira/browse/SPARK-3898
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
 Environment: Spark 1.2.0-snapshot On Yarn
Reporter: zzc
 Fix For: 1.1.1, 1.2.0


 After successfully run an spark application, history web ui display 
 incorrectly: 
 App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started
 Last Updated:2014/10/10 14:50:39 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.


 [ 
https://issues.apache.org/jira/browse/SPARK-3898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zzc updated SPARK-3898:
---
Fix Version/s: 1.1.1

 History Web UI display incorrectly.
 ---

 Key: SPARK-3898
 URL: https://issues.apache.org/jira/browse/SPARK-3898
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
 Environment: Spark 1.2.0-snapshot On Yarn
Reporter: zzc
 Fix For: 1.1.1, 1.2.0


 After successfully run an spark application, history web ui display 
 incorrectly: 
 App Name:Not Started Started:1970/01/01 07:59:59 Spark User:Not Started
 Last Updated:2014/10/10 14:50:39 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3826) enable hive-thriftserver support hive-0.13.1

2014-10-10 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3826:
---
Affects Version/s: (was: 1.1.1)
   1.1.0

 enable hive-thriftserver support hive-0.13.1
 

 Key: SPARK-3826
 URL: https://issues.apache.org/jira/browse/SPARK-3826
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 Now hive-thriftserver not support hive-0.13, to make it support both 0.12 and 
 0.13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3899) wrong links in streaming doc

2014-10-10 Thread wangfei (JIRA)

wangfei created SPARK-3899:
--

 Summary: wrong links in streaming doc
 Key: SPARK-3899
 URL: https://issues.apache.org/jira/browse/SPARK-3899
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: wangfei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-10-10 Thread Yu Ishikawa (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166593#comment-14166593
]

Yu Ishikawa commented on SPARK-2429:

Hi [~rnowling],

Thank you for your comments and advices.

{quote}
Ok, first off, let me make sure I understand what you're doing. You start with
2 centers. You assign all the points. You then apply KMeans recursively to each
cluster, splitting each center into 2 centers. Each instance of KMeans stops
when the error is below a certain value or a fixed number of iterations have
been run.
{quote}
You are right. The algorithm runs as you said.

{quote}
I think your analysis of the overall run time is good and probably what we
expect. Can you break down the timing to see which parts are the most
expensive? Maybe we can figure out where to optimize it.
{quote}
OK. I will measure the execution time of parts of the implementation.

{quote}
1. It might be good to convert everything to Breeze vectors before you do any
operations – you need to convert the same vectors over and over again. KMeans
converts them at the beginning and converts the vectors for the centers back at
the end.
{quote}
I agree with you. I am troubled with this problem. After training the model,
the user seems to select the data in a cluster which is the part of the whole
input data. I think there are three approaches to realize it as below.

# We extract the centers and their `RDD \[Vector\]` data in a cluster through
the training like my implementation
# We extract the centers and their `RDD\[BV\[Double\]\]` data, and then convert
the data into `RDD\[Vector\]` at the last.
The converting from breeze vectors to spark vectors is very slow. That's why we
didn't implement it.
# We only extract the centers through the training, not their data. And then we
apply the trained model to the input data with `predict` method like
scikit-lean in order to extract the part of the data in each cluster.
This seems to be good. We have to save the `RDD\[BV\[Double\]\]` data of each
cluster thorough the clustering. Because we extract the `RDD\[Vector\]` data of
each cluster after the training, I am worried that it is a waste of the
`RDD\[DB\[Double\]\]` data through the clustering. And I am troubled with how
to elegantly save the data in progress of the clustering.

{quote}
2. Instead of passing the centers as part of the EuclideanClosestCenterFinder,
look into using a broadcast variable. See the latest KMeans implementation.
This could improve performance by 10%+.

3. You may want to look into using reduceByKey or similar RDD operations – they
will enable parallel reductions which will be faster than a loop on the master.
{quote}
I will give it a try. Thanks!

Hierarchical Implementation of KMeans
-

Key: SPARK-2429
URL: https://issues.apache.org/jira/browse/SPARK-2429
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
Attachments: The Result of Benchmarking a Hierarchical Clustering.pdf

Hierarchical clustering algorithms are widely used and would make a nice
addition to MLlib. Clustering algorithms are useful for determining
relationships between clusters as well as offering faster assignment.
Discussion on the dev list suggested the following possible approaches:
* Top down, recursive application of KMeans
* Reuse DecisionTree implementation with different objective function
* Hierarchical SVD
It was also suggested that support for distance metrics other than Euclidean
such as negative dot or cosine are necessary.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-10 Thread Oleg Zhurakousky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution 
environments, but it would also be very useful in unit-testing as it would 
provide a clear separation between _assembly_ and _execution_ layer allowing 
them to be tested in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-10 Thread Oleg Zhurakousky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166705#comment-14166705
 ] 

Oleg Zhurakousky edited comment on SPARK-3561 at 10/10/14 12:10 PM:


Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution contexts, 
but it would also be very useful in unit-testing as it would provide a clear 
separation between _assembly_ and _execution_ layer allowing them to be tested 
in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!


was (Author: ozhurakousky):
Patrick, I think there is misunderstanding about the mechanics of this 
proposal, so I'd like to clarify. The proposal here is certainly not to 
introduce any new dependencies to Spark Core and 
existing pull request (https://github.com/apache/spark/pull/2422) clearly shows 
it. 

What I am proposing is to expose an integration point in Spark by means of 
extracting *existing* Spark operations into a *configurable and @Experimental* 
strategy, allowing Spark not only to integrate with other execution 
environments, but it would also be very useful in unit-testing as it would 
provide a clear separation between _assembly_ and _execution_ layer allowing 
them to be tested in isolation. 

I think this feature would benefit Spark tremendously; particularly given how 
several folks have already expressed their interest in this feature/direction.

Appreciate your help and advise in helping to get this contribution into Spark. 
Thanks!

 Allow for pluggable execution contexts in Spark
 ---

 Key: SPARK-3561
 URL: https://issues.apache.org/jira/browse/SPARK-3561
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Oleg Zhurakousky
  Labels: features
 Fix For: 1.2.0

 Attachments: SPARK-3561.pdf


 Currently Spark provides integration with external resource-managers such as 
 Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
 current architecture of Spark-on-YARN can be enhanced to provide 
 significantly better utilization of cluster resources for large scale, batch 
 and/or ETL applications when run alongside other applications (Spark and 
 others) and services in YARN. 
 Proposal: 
 The proposed approach would introduce a pluggable JobExecutionContext (trait) 
 - a gateway and a delegate to Hadoop execution environment - as a non-public 
 api (@DeveloperAPI) not exposed to end users of Spark. 
 The trait will define 4 only operations: 
 * hadoopFile 
 * newAPIHadoopFile 
 * broadcast 
 * runJob 
 Each method directly maps to the corresponding methods in current version of 
 SparkContext. JobExecutionContext implementation will be accessed by 
 SparkContext via master URL as 
 execution-context:foo.bar.MyJobExecutionContext with default implementation 
 containing the existing code from SparkContext, thus allowing current 
 (corresponding) methods of SparkContext to delegate to such implementation. 
 An integrator will now have an option to provide custom implementation of 
 DefaultExecutionContext by either implementing it from scratch or extending 
 form DefaultExecutionContext. 
 Please see the attached design doc for more details. 
 Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName

2014-10-10 Thread Venkata Ramana G (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166724#comment-14166724
 ] 

Venkata Ramana G commented on SPARK-3892:
-

Can you explain in detail?

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166754#comment-14166754
 ] 

Adrian Wang commented on SPARK-3892:


Of course. We are using `.typeName` method to build formatted string and JSON 
to serialize. but in MapType it turns out to be simpleName, I assume it is a 
typo. The `simpleName` function is never used. [~lian cheng]

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166812#comment-14166812
 ] 

Cheng Lian commented on SPARK-3892:
---

Actually {{MapType.simpleName}} can be simply removed, it's not used anywhere, 
I forgot to remove it while refactoring. {{DataType.typeName}} is defined as:
{code}
  def typeName: String = 
this.getClass.getSimpleName.stripSuffix($).dropRight(4).toLowerCase
{code}
So concrete {{DataType}} classes don't need to override {{typeName}} as long as 
their name ends with {{Type}}. 

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type should have typeName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166822#comment-14166822
 ] 

Cheng Lian commented on SPARK-3892:
---

[~adrian-wang] You're right, it's a typo. So would you mind to change the 
priority of this ticket to Minor?

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3892) Map type should have typeName


 [ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-3892:
---
Priority: Minor  (was: Major)

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3892) Map type should have typeName


 [ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-3892:
---
Issue Type: Improvement  (was: Bug)

 Map type should have typeName
 -

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3892) Map type do not need simpleName


 [ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-3892:
---
Summary: Map type do not need simpleName  (was: Map type should have 
typeName)

 Map type do not need simpleName
 ---

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3876) Doing a RDD map/reduce within a DStream map fails with a high enough input rate

2014-10-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166832#comment-14166832
 ] 

Saisai Shao commented on SPARK-3876:


Hi [~afilip], is there any specific purpose you need to do rdd's map and reduce 
operation inside DStream's map function. I don't think this code can be worked 
and correctly executed in remote side, this code can be translated into RDD's 
transformation in each batch duration, like:

rdd.map { r = rdd1.map(c = op(c, r)).reduce(...) }.foreach(...)

since rdd's transformation should be divided into stages in driver side and be 
executed in executor side, remotely using rdd in closure may get error.

 If you want to use this RDD as a lookup table, you can build a local hashmap 
and broadcast to the remote side for looking up. So maybe this is not a bug.

 Doing a RDD map/reduce within a DStream map fails with a high enough input 
 rate
 ---

 Key: SPARK-3876
 URL: https://issues.apache.org/jira/browse/SPARK-3876
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Andrei Filip

 Having a custom receiver than generates random strings at custom rates: 
 JavaRandomSentenceReceiver
 A class that does work on a received string:
 class LengthGetter implements Serializable{
   public int getStrLength(String s){
   return s.length();
   }
 }
 The following code:
 ListLengthGetter objList = Arrays.asList(new LengthGetter(), new 
 LengthGetter(), new LengthGetter());
   
   final JavaRDDLengthGetter objRdd = sc.parallelize(objList);
   
   
   JavaInputDStreamString sentences = jssc.receiverStream(new 
 JavaRandomSentenceReceiver(frequency));
   
   sentences.map(new FunctionString, Integer() {
   @Override
   public Integer call(final String input) throws 
 Exception {
   Integer res = objRdd.map(new 
 FunctionLengthGetter, Integer() {
   @Override
   public Integer call(LengthGetter lg) 
 throws Exception {
   return lg.getStrLength(input);
   }
   }).reduce(new Function2Integer, Integer, 
 Integer() {
   
   @Override
   public Integer call(Integer left, 
 Integer right) throws Exception {
   return left + right;
   }
   });
   
   
   return res;
   }   
   }).print();
 fails for high enough frequencies with the following stack trace:
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 3.0:0 failed 1 times, most recent failure: Exception 
 failure in TID 3 on host localhost: java.lang.NullPointerException
 org.apache.spark.rdd.RDD.map(RDD.scala:270)
 org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:72)
 org.apache.spark.api.java.JavaRDD.map(JavaRDD.scala:29)
 Other information that might be useful is that my current batch duration is 
 set to 1sec and the frequencies for JavaRandomSentenceReceiver at which the 
 application fails are as low as 2Hz (1Hz for example works)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL

2014-10-10 Thread Ravindra Pesala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166838#comment-14166838
 ] 

Ravindra Pesala commented on SPARK-3880:


There is already some work going in the direction of adding foreign data 
sources to Spark SQL. https://github.com/apache/spark/pull/2475. So I guess 
Hbase is also like foreign data source and it should fit into this design.  
Adding new project/context for each datasource may be cumbersome to maintain. 
Can we improve on the current PR to add DDL support.

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166840#comment-14166840
 ] 

Adrian Wang commented on SPARK-3892:


Yeah. Actually the original method is called simpleString, and now we have 
typeName.

 Map type do not need simpleName
 ---

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166848#comment-14166848
 ] 

Cheng Lian commented on SPARK-3892:
---

Ah, while working on the {{DataType}} JSON ser/de PR 
([#2563|https://github.com/apache/spark/pull/2563]), I had once refactored 
{{simpleString}} to {{simpleName}}, and at last got the current version and 
removed all overrides from sub-classes. {{MapType.simpleName}} was not removed 
partly because its a member of {{object MapType}}, which is not a subclass of 
{{DataType}}. Sorry for the trouble and confusion.

 Map type do not need simpleName
 ---

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166856#comment-14166856
 ] 

Adrian Wang commented on SPARK-3892:


Thanks for the explain! I have create PR #2747 to change simpleName to 
typeName. Maybe it is also useful since we defined this in object MapType. For 
class MapType, we already have the default one... Did I made anything wrong 
here?

 Map type do not need simpleName
 ---

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3892) Map type do not need simpleName


[ 
https://issues.apache.org/jira/browse/SPARK-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166886#comment-14166886
 ] 

Cheng Lian commented on SPARK-3892:
---

Please see my comments in the PR :)

 Map type do not need simpleName
 ---

 Key: SPARK-3892
 URL: https://issues.apache.org/jira/browse/SPARK-3892
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3880) HBase as data source to SparkSQL

2014-10-10 Thread Yan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167006#comment-14167006
 ] 

Yan commented on SPARK-3880:


The new context is intended to be very light-weighted. We noticed that SparkSQL 
is a very active project and there have been talks/Jiras about SQLContext and 
data sources. As mentioned in the design, we are aware of the PR, and the need 
to have a universal mechanism to access different types of data stores,  and 
will a keep close watch on the latest movements and will definitely fit our 
efforts to those latest features and interfaces when they are ready and 
reasonably stable. In the meanwhile, the design is intended to be heavy of 
HBase-specific data model, data access mechanisms and query optimizations, and 
keep the integration part light-weighted so it can be easily adjusted to future 
changes.

The point is that we need to find some compromise between a rapidly changing 
project and the need to have a more or less stable context to base a new 
feature on. Chasing a constantly moving target is never easy, I guess.

 HBase as data source to SparkSQL
 

 Key: SPARK-3880
 URL: https://issues.apache.org/jira/browse/SPARK-3880
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yan
Assignee: Yan
 Attachments: HBaseOnSpark.docx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3900) ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.

2014-10-10 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3900:
--
Summary: ApplicationMaster's shutdown hook fails and IllegalStateException 
is thrown.  (was: ApplicationMaster's shutdown hook fails to cleanup staging 
directory.)

 ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.
 

 Key: SPARK-3900
 URL: https://issues.apache.org/jira/browse/SPARK-3900
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Hadoop 0.23
Reporter: Kousuke Saruta
Priority: Critical

 ApplicationMaster registers a shutdown hook and it calls 
 ApplicationMaster#cleanupStagingDir.
 cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes 
 FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook.
 In FileSystem of hadoop 0.23, the shutdown hook registration does not 
 consider whether shutdown is in progress or not (In 2.2, it's considered).
 {code}
 // 0.23 
 if (map.isEmpty() ) {
   ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
 SHUTDOWN_HOOK_PRIORITY);
 }
 {code}
 {code}
 // 2.2
 if (map.isEmpty()
  !ShutdownHookManager.get().isShutdownInProgress()) {
ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
 SHUTDOWN_HOOK_PRIORITY);
 }
 {code}
 Thus, in 0.23, another shutdown hook can be registered when 
 ApplicationMaster's shutdown hook run.
 This issue cause IllegalStateException as follows.
 {code}
 java.lang.IllegalStateException: Shutdown in progress, cannot add a 
 shutdownHook
 at 
 org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152)
 at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118)
 at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3900) ApplicationMaster's shutdown hook fails to cleanup staging directory.

2014-10-10 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-3900:
-

 Summary: ApplicationMaster's shutdown hook fails to cleanup 
staging directory.
 Key: SPARK-3900
 URL: https://issues.apache.org/jira/browse/SPARK-3900
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
 Environment: Hadoop 0.23
Reporter: Kousuke Saruta
Priority: Critical


ApplicationMaster registers a shutdown hook and it calls 
ApplicationMaster#cleanupStagingDir.

cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes 
FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook.
In FileSystem of hadoop 0.23, the shutdown hook registration does not consider 
whether shutdown is in progress or not (In 2.2, it's considered).

{code}
// 0.23 
if (map.isEmpty() ) {
  ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
SHUTDOWN_HOOK_PRIORITY);
}
{code}

{code}
// 2.2
if (map.isEmpty()
 !ShutdownHookManager.get().isShutdownInProgress()) {
   ShutdownHookManager.get().addShutdownHook(clientFinalizer, 
SHUTDOWN_HOOK_PRIORITY);
}
{code}

Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's 
shutdown hook run.

This issue cause IllegalStateException as follows.

{code}
java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook
at 
org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors

2014-10-10 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167131#comment-14167131
 ] 

Nan Zhu commented on SPARK-3795:


this is for YARN or standalone?

 Add scheduler hooks/heuristics for adding and removing executors
 

 Key: SPARK-3795
 URL: https://issues.apache.org/jira/browse/SPARK-3795
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 To support dynamic scaling of a Spark application, Spark's scheduler will 
 need to have hooks around explicitly decommissioning executors. We'll also 
 need basic heuristics governing when to start/stop executors based on load. 
 An initial goal is to keep this very simple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3845) SQLContext(...) should inherit configurations from SparkContext

2014-10-10 Thread Jianshi Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jianshi Huang resolved SPARK-3845.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

 SQLContext(...) should inherit configurations from SparkContext
 ---

 Key: SPARK-3845
 URL: https://issues.apache.org/jira/browse/SPARK-3845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jianshi Huang
 Fix For: 1.2.0


 It's very confusing that Spark configurations (e.g. spark.serializer, 
 spark.speculation, etc.) can be set in the spark-default.conf file, while 
 SparkSQL configurations (e..g spark.sql.inMemoryColumnarStorage.compressed, 
 spark.sql.codegen, etc.) has to be set either in sqlContext.setConf or 
 sql(SET ...).
 When I do:
   val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 I would expect sqlContext recognizes all the SQL configurations comes with 
 sparkContext.
 Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2962) Suboptimal scheduling in spark

2014-10-10 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167144#comment-14167144
 ] 

Nan Zhu commented on SPARK-2962:


Hi, [~mrid...@yahoo-inc.com]

I think this has been fixed in https://github.com/apache/spark/pull/1313/files,

{code:title=TaskSetManager.scala|borderStyle=solid}

if (tasks(index).preferredLocations == Nil) {
   addTo(pendingTasksWithNoPrefs)
 }
{code}

Now, only tasks without explicit preference is added to 
pendingTasksWithNoPrefs, and NO_PREF tasks are always scheduled after NODE_LOCAL




 Suboptimal scheduling in spark
 --

 Key: SPARK-2962
 URL: https://issues.apache.org/jira/browse/SPARK-2962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan

 In findTask, irrespective of 'locality' specified, pendingTasksWithNoPrefs 
 are always scheduled with PROCESS_LOCAL
 pendingTasksWithNoPrefs contains tasks which currently do not have any alive 
 locations - but which could come in 'later' : particularly relevant when 
 spark app is just coming up and containers are still being added.
 This causes a large number of non node local tasks to be scheduled incurring 
 significant network transfers in the cluster when running with non trivial 
 datasets.
 The comment // Look for no-pref tasks after rack-local tasks since they can 
 run anywhere. is misleading in the method code : locality levels start from 
 process_local down to any, and so no prefs get scheduled much before rack.
 Also note that, currentLocalityIndex is reset to the taskLocality returned by 
 this method - so returning PROCESS_LOCAL as the level will trigger wait times 
 again. (Was relevant before recent change to scheduler, and might be again 
 based on resolution of this issue).
 Found as part of writing test for SPARK-2931
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-10 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167152#comment-14167152
 ] 

Burak Yavuz commented on SPARK-3434:


[~ConcreteVitamin], any updates? Anything I can help out with?

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3823) Spark Hive SQL readColumn is not reset each time for a new query

2014-10-10 Thread Ravindra Pesala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167239#comment-14167239
 ] 

Ravindra Pesala commented on SPARK-3823:


It seems this issue is duplicate of 
https://issues.apache.org/jira/browse/SPARK-3559

 Spark Hive SQL readColumn is not reset each time for a new query
 

 Key: SPARK-3823
 URL: https://issues.apache.org/jira/browse/SPARK-3823
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Alex Liu

 After a few queries running in the same hiveContext, 
 hive.io.file.readcolumn.ids and hive.io.file.readcolumn.names values are 
 added on by pre-running queries.
 e.g. running the following querys
 {code}
 hql(use sql_integration_ks)
 val container = hql(select * from double_table as aa JOIN boolean_table as 
 bb on aa.type_id = bb.type_id)
 container.collect().foreach(println)
 val container = hql(select * from ascii_table ORDER BY type_id)
  container.collect().foreach(println)
 val container = hql(select shippers.shippername, COUNT(orders.orderid) AS 
 numorders FROM orders LEFT JOIN shippers ON 
 orders.shipperid=shippers.shipperid GROUP BY shippername)
  container.collect().foreach(println)
 val container = hql(select * from ascii_table where type_id  126)
 container.collect().length
 {code}
 The read column ids for the last query are [2, 0, 3, 1]
 read column names are : 
 type_id,value,type_id,value,type_id,value,orderid,shipperid,shipper name, 
 shipperid
 The source code is at 
 https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala#L80
 hiveContext has a shared hiveconf which add readColumns for each query. It 
 should be reset each time for a new hive query or remove the duplicate 
 readColumn Ids



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-10 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167303#comment-14167303
 ] 

Mridul Muralidharan commented on SPARK-3889:


The status says fixed - what was done to resolve this ? I did not see a PR ...

 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
 ---

 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 Here's the first part of the core dump, possibly caused by a job which 
 shuffles a lot of very small partitions.
 {code}
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
 #
 # JRE version: 7.0_25-b30
 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
 compressed oops)
 # Problematic frame:
 # v  ~StubRoutines::jbyte_disjoint_arraycopy
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # If you would like to submit a bug report, please include
 # instructions on how to reproduce the bug and visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
 #
 ---  T H R E A D  ---
 Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
 worker-170 daemon [_thread_in_Java, id=6783, 
 stack(0x7fa4448ef000,0x7fa4449f)]
 siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
 si_addr=0x7fa428f79000
 {code}
 Here is the only useful content I can find related to JVM and SIGBUS from 
 Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664
 It appears it may be related to disposing byte buffers, which we do in the 
 ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
 them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms

2014-10-10 Thread Shuo Xiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuo Xiang updated SPARK-3568:
--
Description: 
Include common metrics for ranking algorithms 
(http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[T], Array[T]]*. The following methods will be 
implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /**
   * @param k the position to compute the truncated precision
   * @return the average precision at the first k ranking positions
   */
  def precision(k: Int): Double

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcgAtK: RDD[Array[Double]]

  /**
   * @param k the position to compute the truncated ndcg
   * @return the average ndcg at the first k ranking positions
   */
  def ndcg(k: Int): Double
}
{code}


  was:
Include common metrics for ranking algorithms 
(http://www-nlp.stanford.edu/IR-book/), including:
 - Mean Average Precision
 - Precision@n: top-n precision
 - Discounted cumulative gain (DCG) and NDCG 

This implementation attempts to create a new class called *RankingMetrics* 
under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and 
label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will 
be implemented:

{code:title=RankingMetrics.scala|borderStyle=solid}
class RankingMetrics(predictionAndLabels: RDD[(Array[Double], Array[Double])]) {
  /* Returns the precsion@k for each query */
  lazy val precAtK: RDD[Array[Double]]

  /* Returns the average precision for each query */
  lazy val avePrec: RDD[Double]

  /*Returns the mean average precision (MAP) of all the queries*/
  lazy val meanAvePrec: Double

  /*Returns the normalized discounted cumulative gain for each query */
  lazy val ndcg: RDD[Double]

  /* Returns the mean NDCG of all the queries */
  lazy val meanNdcg: Double
}
{code}



 Add metrics for ranking algorithms
 --

 Key: SPARK-3568
 URL: https://issues.apache.org/jira/browse/SPARK-3568
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Shuo Xiang
Assignee: Shuo Xiang

 Include common metrics for ranking algorithms 
 (http://www-nlp.stanford.edu/IR-book/), including:
  - Mean Average Precision
  - Precision@n: top-n precision
  - Discounted cumulative gain (DCG) and NDCG 
 This implementation attempts to create a new class called *RankingMetrics* 
 under *org.apache.spark.mllib.evaluation*, which accepts input (prediction 
 and label pairs) as *RDD[Array[T], Array[T]]*. The following methods will be 
 implemented:
 {code:title=RankingMetrics.scala|borderStyle=solid}
 class RankingMetrics[T](predictionAndLabels: RDD[(Array[T], Array[T])]) {
   /* Returns the precsion@k for each query */
   lazy val precAtK: RDD[Array[Double]]
   /**
* @param k the position to compute the truncated precision
* @return the average precision at the first k ranking positions
*/
   def precision(k: Int): Double
   /* Returns the average precision for each query */
   lazy val avePrec: RDD[Double]
   /*Returns the mean average precision (MAP) of all the queries*/
   lazy val meanAvePrec: Double
   /*Returns the normalized discounted cumulative gain for each query */
   lazy val ndcgAtK: RDD[Array[Double]]
   /**
* @param k the position to compute the truncated ndcg
* @return the average ndcg at the first k ranking positions
*/
   def ndcg(k: Int): Double
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3795) Add scheduler hooks/heuristics for adding and removing executors

2014-10-10 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167457#comment-14167457
 ] 

Andrew Or commented on SPARK-3795:
--

It's agnostic to the cluster manager, but for now we will focus on Yarn 
(SPARK-3822). Later we will do the same for standalone and mesos.

 Add scheduler hooks/heuristics for adding and removing executors
 

 Key: SPARK-3795
 URL: https://issues.apache.org/jira/browse/SPARK-3795
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Patrick Wendell
Assignee: Andrew Or

 To support dynamic scaling of a Spark application, Spark's scheduler will 
 need to have hooks around explicitly decommissioning executors. We'll also 
 need basic heuristics governing when to start/stop executors based on load. 
 An initial goal is to keep this very simple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-10-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478
 ] 

Shivaram Venkataraman commented on SPARK-3434:
--

~brkyvz -- We are just adding a few more test cases to classes to make sure our 
interfaces look fine. I'll also create a simple design doc and post it here.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3434) Distributed block matrix

2014-10-10 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167478#comment-14167478
 ] 

Shivaram Venkataraman edited comment on SPARK-3434 at 10/10/14 8:45 PM:


[~brkyvz] -- We are just adding a few more test cases to classes to make sure 
our interfaces look fine. I'll also create a simple design doc and post it here.


was (Author: shivaram):
~brkyvz -- We are just adding a few more test cases to classes to make sure our 
interfaces look fine. I'll also create a simple design doc and post it here.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3886) Choose the batch size of serializer based on size of object


 [ 
https://issues.apache.org/jira/browse/SPARK-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3886.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2740
[https://github.com/apache/spark/pull/2740]

 Choose the batch size of serializer based on size of object
 ---

 Key: SPARK-3886
 URL: https://issues.apache.org/jira/browse/SPARK-3886
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 The default batch size (1024) maybe will not work for huge objects, so it's 
 better to choose the proper size based on the size of objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3886) Choose the batch size of serializer based on size of object


 [ 
https://issues.apache.org/jira/browse/SPARK-3886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3886:
--
Affects Version/s: 1.0.2
   1.1.0

 Choose the batch size of serializer based on size of object
 ---

 Key: SPARK-3886
 URL: https://issues.apache.org/jira/browse/SPARK-3886
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.2, 1.1.0
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 The default batch size (1024) maybe will not work for huge objects, so it's 
 better to choose the proper size based on the size of objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3901) Add SocketSink capability for Spark metrics

2014-10-10 Thread Sreepathi Prasanna (JIRA)

Sreepathi Prasanna created SPARK-3901:
-

 Summary: Add SocketSink capability for Spark metrics
 Key: SPARK-3901
 URL: https://issues.apache.org/jira/browse/SPARK-3901
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.0.0
Reporter: Sreepathi Prasanna
Priority: Minor
 Fix For: 1.1.1


Spark depends on Coda hale metrics library to collect metrics. Today we can 
send metrics to console, csv and jmx. We use chukwa as a monitoring framework 
to monitor the hadoop services. To extend the the framework to collect spark 
metrics, we need additional socketsink capability which is not there at the 
moment in Spark. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics

2014-10-10 Thread Sreepathi Prasanna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167541#comment-14167541
 ] 

Sreepathi Prasanna commented on SPARK-3901:
---

For this, we need a SocketReporter class in coda hale which I have submitted a 
request for. 

https://github.com/dropwizard/metrics/pull/685

Once this is reviewed and merged into coda hale, we can use a socketsink class 
to send the metrics over socket. 

 Add SocketSink capability for Spark metrics
 ---

 Key: SPARK-3901
 URL: https://issues.apache.org/jira/browse/SPARK-3901
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Sreepathi Prasanna
Priority: Minor
 Fix For: 1.1.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 Spark depends on Coda hale metrics library to collect metrics. Today we can 
 send metrics to console, csv and jmx. We use chukwa as a monitoring framework 
 to monitor the hadoop services. To extend the the framework to collect spark 
 metrics, we need additional socketsink capability which is not there at the 
 moment in Spark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3901) Add SocketSink capability for Spark metrics

2014-10-10 Thread Sreepathi Prasanna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167550#comment-14167550
 ] 

Sreepathi Prasanna commented on SPARK-3901:
---

I have the patch ready, but it will not work unless we have the SocketReporter 
in the coda hale. 

 Add SocketSink capability for Spark metrics
 ---

 Key: SPARK-3901
 URL: https://issues.apache.org/jira/browse/SPARK-3901
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0, 1.1.0
Reporter: Sreepathi Prasanna
Priority: Minor
 Fix For: 1.1.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 Spark depends on Coda hale metrics library to collect metrics. Today we can 
 send metrics to console, csv and jmx. We use chukwa as a monitoring framework 
 to monitor the hadoop services. To extend the the framework to collect spark 
 metrics, we need additional socketsink capability which is not there at the 
 moment in Spark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3902) Stabilize AsyncRDDActions and expose its methods in Java API

Josh Rosen created SPARK-3902:
-

 Summary: Stabilize AsyncRDDActions and expose its methods in Java 
API
 Key: SPARK-3902
 URL: https://issues.apache.org/jira/browse/SPARK-3902
 Project: Spark
  Issue Type: New Feature
  Components: Java API, Spark Core
Reporter: Josh Rosen


The AsyncRDDActions methods are currently the easiest way to determine Spark 
jobs' ids for use in progress-monitoring code (see SPARK-2636).  
AsyncRDDActions is currently marked as {{@Experimental}}; for 1.2, I think that 
we should stabilize this API and expose it in Java, too.

One concern is whether there's a better async API design that we should prefer 
over this one as our stable API; I had some ideas for a more general API in 
SPARK-3626 (discussed in much greater detail on GitHub: 
https://github.com/apache/spark/pull/2482) but decided against the more general 
API due to its confusing cancellation semantics.  Given this, I'd be 
comfortable stabilizing our current API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3626) Replace AsyncRDDActions with a more general async. API


[ 
https://issues.apache.org/jira/browse/SPARK-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167571#comment-14167571
 ] 

Josh Rosen commented on SPARK-3626:
---

I've opened SPARK-3902 to discuss stabilizing our current AsyncRDDActions APIs.

 Replace AsyncRDDActions with a more general async. API
 --

 Key: SPARK-3626
 URL: https://issues.apache.org/jira/browse/SPARK-3626
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 The experimental AsyncRDDActions APIs seem to only exist in order to enable 
 job cancellation.
 We've been considering extending these APIs to support progress monitoring, 
 but this would require stabilizing them so they're no longer 
 {{@Experimental}}.
 Instead, I propose to replace all of the AsyncRDDActions with a mechanism 
 based on job groups which allows arbitrary computations to be run in job 
 groups and supports cancellation / monitoring of Spark jobs launched from 
 those computations.
 (full design pending; see my GitHub PR for more details).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-10 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167647#comment-14167647
 ] 

Aaron Davidson commented on SPARK-3889:
---

Sorry, it was not linked: https://github.com/apache/spark/pull/2742

 JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
 ---

 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical
 Fix For: 1.2.0


 Here's the first part of the core dump, possibly caused by a job which 
 shuffles a lot of very small partitions.
 {code}
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
 #
 # JRE version: 7.0_25-b30
 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
 compressed oops)
 # Problematic frame:
 # v  ~StubRoutines::jbyte_disjoint_arraycopy
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # If you would like to submit a bug report, please include
 # instructions on how to reproduce the bug and visit:
 #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
 #
 ---  T H R E A D  ---
 Current thread (0x7fa4b0631000):  JavaThread Executor task launch 
 worker-170 daemon [_thread_in_Java, id=6783, 
 stack(0x7fa4448ef000,0x7fa4449f)]
 siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
 si_addr=0x7fa428f79000
 {code}
 Here is the only useful content I can find related to JVM and SIGBUS from 
 Google: https://bugzilla.redhat.com/show_bug.cgi?format=multipleid=976664
 It appears it may be related to disposing byte buffers, which we do in the 
 ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
 them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3855) Binding Exception when running PythonUDFs

2014-10-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3855:
---
Component/s: PySpark

 Binding Exception when running PythonUDFs
 -

 Key: SPARK-3855
 URL: https://issues.apache.org/jira/browse/SPARK-3855
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 {code}
 from pyspark import *
 from pyspark.sql import *
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 sqlContext.registerFunction(strlen, lambda string: len(string))
 sqlContext.inferSchema(sc.parallelize([Row(a=test)])).registerTempTable(test)
 srdd = sqlContext.sql(SELECT strlen(a) FROM test WHERE strlen(a)  1)
 print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString()
 print srdd.collect()
 {code}
 output:
 {code}
 == Parsed Logical Plan ==
 Project ['strlen('a) AS c0#1]
  Filter ('strlen('a)  1)
   UnresolvedRelation None, test, None
 == Analyzed Logical Plan ==
 Project [c0#1]
  Project [pythonUDF#2 AS c0#1]
   EvaluatePython PythonUDF#strlen(a#0)
Project [a#0]
 Filter (CAST(pythonUDF#3, DoubleType)  CAST(1, DoubleType))
  EvaluatePython PythonUDF#strlen(a#0)
   SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
 mapPartitions at SQLContext.scala:525)
 == Optimized Logical Plan ==
 Project [pythonUDF#2 AS c0#1]
  EvaluatePython PythonUDF#strlen(a#0)
   Project [a#0]
Filter (CAST(pythonUDF#3, DoubleType)  1.0)
 EvaluatePython PythonUDF#strlen(a#0)
  SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
 mapPartitions at SQLContext.scala:525)
 == Physical Plan ==
 Project [pythonUDF#2 AS c0#1]
  BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5]
   Project [a#0]
Filter (CAST(pythonUDF#3, DoubleType)  1.0)
 BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3]
  ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at 
 SQLContext.scala:525
 Code Generation: false
 == RDD ==
 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#2
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at

[jira] [Updated] (SPARK-3855) Binding Exception when running PythonUDFs

2014-10-10 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3855:
---
Component/s: SQL

 Binding Exception when running PythonUDFs
 -

 Key: SPARK-3855
 URL: https://issues.apache.org/jira/browse/SPARK-3855
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust

 {code}
 from pyspark import *
 from pyspark.sql import *
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 sqlContext.registerFunction(strlen, lambda string: len(string))
 sqlContext.inferSchema(sc.parallelize([Row(a=test)])).registerTempTable(test)
 srdd = sqlContext.sql(SELECT strlen(a) FROM test WHERE strlen(a)  1)
 print srdd._jschema_rdd.baseSchemaRDD().queryExecution().toString()
 print srdd.collect()
 {code}
 output:
 {code}
 == Parsed Logical Plan ==
 Project ['strlen('a) AS c0#1]
  Filter ('strlen('a)  1)
   UnresolvedRelation None, test, None
 == Analyzed Logical Plan ==
 Project [c0#1]
  Project [pythonUDF#2 AS c0#1]
   EvaluatePython PythonUDF#strlen(a#0)
Project [a#0]
 Filter (CAST(pythonUDF#3, DoubleType)  CAST(1, DoubleType))
  EvaluatePython PythonUDF#strlen(a#0)
   SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
 mapPartitions at SQLContext.scala:525)
 == Optimized Logical Plan ==
 Project [pythonUDF#2 AS c0#1]
  EvaluatePython PythonUDF#strlen(a#0)
   Project [a#0]
Filter (CAST(pythonUDF#3, DoubleType)  1.0)
 EvaluatePython PythonUDF#strlen(a#0)
  SparkLogicalPlan (ExistingRdd [a#0], MapPartitionsRDD[7] at 
 mapPartitions at SQLContext.scala:525)
 == Physical Plan ==
 Project [pythonUDF#2 AS c0#1]
  BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#5]
   Project [a#0]
Filter (CAST(pythonUDF#3, DoubleType)  1.0)
 BatchPythonEvaluation PythonUDF#strlen(a#0), [a#0,pythonUDF#3]
  ExistingRdd [a#0], MapPartitionsRDD[7] at mapPartitions at 
 SQLContext.scala:525
 Code Generation: false
 == RDD ==
 14/10/08 15:03:00 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 9)
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
 attribute, tree: pythonUDF#2
   at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:47)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:46)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:46)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
   at 
 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at

[jira] [Commented] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-10-10 Thread Aaron Staple (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167868#comment-14167868
 ] 

Aaron Staple commented on SPARK-1503:
-

[~mengxr] Thanks for the heads up! I’ll definitely go through TFOCS and am 
happy to work carefully and collaboratively on design.

 Implement Nesterov's accelerated first-order method
 ---

 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Aaron Staple

 Nesterov's accelerated first-order method is a drop-in replacement for 
 steepest descent but it converges much faster. We should implement this 
 method and compare its performance with existing algorithms, including SGD 
 and L-BFGS.
 TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
 method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3903) Create general data loading method for LabeledPoints

2014-10-10 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-3903:


 Summary: Create general data loading method for LabeledPoints
 Key: SPARK-3903
 URL: https://issues.apache.org/jira/browse/SPARK-3903
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Proposal: Provide a more general data loading function for LabeledPoints.
* load multiple data files (e.g., train + test), and ensure they have the same 
number of features (determined based on a scan of the data)
* use same function for multiple input formats

Proposed function format (in MLUtils), with default parameters:
{code}
def loadLabeledPointsFiles(
sc: SparkContext,
paths: Seq[String],
numFeatures = -1,
vectorFormat = auto,
numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
{code}

About the parameters:
* paths: list of paths to data files or folders with data files
* vectorFormat options: dense/sparse/auto
* numFeatures, numPartitions: same behavior as loadLibSVMFile




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3903) Create general data loading method for LabeledPoints

2014-10-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3903:
-
Description: 
Proposal: Provide a more general data loading function for LabeledPoints.
* load multiple data files (e.g., train + test), and ensure they have the same 
number of features (determined based on a scan of the data)
* use same function for multiple input formats

Proposed function format (in MLUtils), with default parameters:
{code}
def loadLabeledPointsFiles(
sc: SparkContext,
paths: Seq[String],
numFeatures = -1,
vectorFormat = auto,
numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
{code}

About the parameters:
* paths: list of paths to data files or folders with data files
* vectorFormat options: dense/sparse/auto
* numFeatures, numPartitions: same behavior as loadLibSVMFile

Return value: Order of RDDs follows the order of the paths.

Note: This is named differently from loadLabeledPoints for 2 reasons:
* different argument order (following loadLibSVMFile)
* different return type


  was:
Proposal: Provide a more general data loading function for LabeledPoints.
* load multiple data files (e.g., train + test), and ensure they have the same 
number of features (determined based on a scan of the data)
* use same function for multiple input formats

Proposed function format (in MLUtils), with default parameters:
{code}
def loadLabeledPointsFiles(
sc: SparkContext,
paths: Seq[String],
numFeatures = -1,
vectorFormat = auto,
numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
{code}

About the parameters:
* paths: list of paths to data files or folders with data files
* vectorFormat options: dense/sparse/auto
* numFeatures, numPartitions: same behavior as loadLibSVMFile



 Create general data loading method for LabeledPoints
 

 Key: SPARK-3903
 URL: https://issues.apache.org/jira/browse/SPARK-3903
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor

 Proposal: Provide a more general data loading function for LabeledPoints.
 * load multiple data files (e.g., train + test), and ensure they have the 
 same number of features (determined based on a scan of the data)
 * use same function for multiple input formats
 Proposed function format (in MLUtils), with default parameters:
 {code}
 def loadLabeledPointsFiles(
 sc: SparkContext,
 paths: Seq[String],
 numFeatures = -1,
 vectorFormat = auto,
 numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
 {code}
 About the parameters:
 * paths: list of paths to data files or folders with data files
 * vectorFormat options: dense/sparse/auto
 * numFeatures, numPartitions: same behavior as loadLibSVMFile
 Return value: Order of RDDs follows the order of the paths.
 Note: This is named differently from loadLabeledPoints for 2 reasons:
 * different argument order (following loadLibSVMFile)
 * different return type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3903) Create general data loading method for LabeledPoints


 [ 
https://issues.apache.org/jira/browse/SPARK-3903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3903:
-
Assignee: Joseph K. Bradley

 Create general data loading method for LabeledPoints
 

 Key: SPARK-3903
 URL: https://issues.apache.org/jira/browse/SPARK-3903
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 Proposal: Provide a more general data loading function for LabeledPoints.
 * load multiple data files (e.g., train + test), and ensure they have the 
 same number of features (determined based on a scan of the data)
 * use same function for multiple input formats
 Proposed function format (in MLUtils), with default parameters:
 {code}
 def loadLabeledPointsFiles(
 sc: SparkContext,
 paths: Seq[String],
 numFeatures = -1,
 vectorFormat = auto,
 numPartitions = sc.defaultMinPartitions): Seq[RDD[LabeledPoint]]
 {code}
 About the parameters:
 * paths: list of paths to data files or folders with data files
 * vectorFormat options: dense/sparse/auto
 * numFeatures, numPartitions: same behavior as loadLibSVMFile
 Return value: Order of RDDs follows the order of the paths.
 Note: This is named differently from loadLabeledPoints for 2 reasons:
 * different argument order (following loadLibSVMFile)
 * different return type



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3898) History Web UI display incorrectly.