date:20150115

[jira] [Created] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters

2015-01-15 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-5262:
--

 Summary: coalesce should allow NullType and 1 another type in 
parameters
 Key: SPARK-5262
 URL: https://issues.apache.org/jira/browse/SPARK-5262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278416#comment-14278416
 ] 

Apache Spark commented on SPARK-5262:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4057

> coalesce should allow NullType and 1 another type in parameters
> ---
>
> Key: SPARK-5262
> URL: https://issues.apache.org/jira/browse/SPARK-5262
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>
> Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1084) Fix most build warnings

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1084:
--
Reporter: Sean Owen  (was: Sean Owen)

> Fix most build warnings
> ---
>
> Key: SPARK-1084
> URL: https://issues.apache.org/jira/browse/SPARK-1084
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: mvn, sbt, warning
> Fix For: 1.0.0
>
>
> I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of 
> the warnings that appear during build, so that developers don't become 
> accustomed to them. The accompanying pull request contains a number of 
> commits to quash most warnings observed through the mvn and sbt builds, 
> although not all of them.
> FIXED!
> [WARNING] Parameter tasks is deprecated, use target instead
> Just a matter of updating  ->  in inline Ant scripts.
> WARNING: -p has been deprecated and will be reused for a different (but still 
> very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R.
> Goes away with updating scalatest plugin -> 1.0-RC2
> [WARNING] Note: 
> /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java
>  uses unchecked or unsafe operations.
> [WARNING] Note: Recompile with -Xlint:unchecked for details.
> Mostly @SuppressWarnings("unchecked") but needed a few more things to reveal 
> the warning source: true (also needd for ) and version 
> 3.1 of the plugin. In a few cases some declaration changes were appropriate 
> to avoid warnings.
> /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25:
>  warning: Could not find any member to link for "akka.actor.ActorSystem".
> /**
> ^
> Getting several scaladoc errors like this and I'm not clear why it can't find 
> the type -- outside its module? Remove the links as they're evidently not 
> linking anyway?
> /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86:
>  warning: Variable eval undefined in comment for class SparkIMain in class 
> SparkIMain
> $ has to be escaped as \$ in scaladoc, apparently
> [WARNING] 
> 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId'
>  for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a 
> valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, 
> /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25
> This one might need review.
> This is valid Maven syntax, but, Maven still warns on it. I wanted to see if 
> we can do without it. 
> These are trying to exclude:
> - org.codehaus.jackson
> - org.sonatype.sisu.inject
> - org.xerial.snappy
> org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. 
> org.xerial.snappy is used by dependencies but the version seems to match 
> anyway (1.0.5).
> org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming 
> wants 1.9.11 directly. But the exclusion is in the wrong place if so, since 
> Spark depends straight on Avro, which is what brings in 1.8.8, still. 
> (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but 
> the other Hadoop modules don't.)
> HBase depends on 1.8.8 but figured it was intentional to leave that as it 
> would not collide with Spark streaming. (?)
> (I understand this varies by Hadoop version but confirmed this is all the 
> same for 1.0.4, 0.23.7, 2.2.0.)
> NOT FIXED.
> [warn] 
> /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305:
>  method connect in class IOManager is deprecated: use the new implementation 
> in package akka.io instead
> [warn]   override def preStart = IOManager(context.system).connect(new 
> InetSocketAddress(port))
> Not confident enough to fix this.
> [WARNING] there were 6 feature warning(s); re-run with -feature for details
> Don't know enough Scala to address these, yet.
> [WARNING] We have a duplicate 
> org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in 
> /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar
> Probably addressable by being more careful about how binaries are packed 
> though this appear to be ignorable; two identical copies of the class are 
> colliding.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> and
> [WARNING] JAR will be empty - no content was marked for inclusion!
> Apparently harmless warnings, but I don't know how to disable them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubsc

[jira] [Updated] (SPARK-1181) 'mvn test' fails out of the box since sbt assembly does not necessarily exist

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1181:
--
Reporter: Sean Owen  (was: Sean Owen)

> 'mvn test' fails out of the box since sbt assembly does not necessarily exist
> -
>
> Key: SPARK-1181
> URL: https://issues.apache.org/jira/browse/SPARK-1181
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>  Labels: assembly, maven, sbt, test
>
> The test suite requires that "sbt assembly" has been run in order for some 
> tests (like DriverSuite) to pass. The tests themselves say as much.
> This means that a "mvn test" from a fresh clone fails.
> There's a pretty simple fix, to have Maven's test-compile phase invoke "sbt 
> assembly". I suppose the only downside is re-invoking "sbt assembly" each 
> time tests are run.
> I'm open to ideas about how to set this up more intelligently but it would be 
> a generally good thing if the Maven build's tests passed out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1071:
--
Reporter: Sean Owen  (was: Sean Owen)

> Tidy logging strategy and use of log4j
> --
>
> Key: SPARK-1071
> URL: https://issues.apache.org/jira/browse/SPARK-1071
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Input/Output
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> Prompted by a recent thread on the mailing list, I tried and failed to see if 
> Spark can be made independent of log4j. There are a few cases where control 
> of the underlying logging is pretty useful, and to do that, you have to bind 
> to a specific logger. 
> Instead I propose some tidying that leaves Spark's use of log4j, but gets rid 
> of warnings and should still enable downstream users to switch. The idea is 
> to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J 
> directly when logging, and where Spark needs to output info (REPL and tests), 
> bind from SLF4J to log4j.
> This leaves the same behavior in Spark. It means that downstream users who 
> want to use something except log4j should:
> - Exclude dependencies on log4j, slf4j-log4j12 from Spark
> - Include dependency on log4j-over-slf4j
> - Include dependency on another logger X, and another slf4j-X
> - Recreate any log config that Spark does, that is needed, in the other 
> logger's config
> That sounds about right.
> Here are the key changes: 
> - Include the jcl-over-slf4j shim everywhere by depending on it in core.
> - Exclude dependencies on commons-logging from third-party libraries.
> - Include the jul-to-slf4j shim everywhere by depending on it in core.
> - Exclude slf4j-* dependencies from third-party libraries to prevent 
> collision or warnings
> - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
> And minor/incidental changes:
> - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a 
> recommended update over 1.7.2
> - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
> - (Remove a duplicate mockito dependency declaration that was causing 
> warnings and bugging me)
> Pull request coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1254:
--
Reporter: Sean Owen  (was: Sean Owen)

> Consolidate, order, and harmonize repository declarations in Maven/SBT builds
> -
>
> Key: SPARK-1254
> URL: https://issues.apache.org/jira/browse/SPARK-1254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> This suggestion addresses a few minor suboptimalities with how repositories 
> are handled.
> 1) Use HTTPS consistently to access repos, instead of HTTP
> 2) Consolidate repository declarations in the parent POM file, in the case of 
> the Maven build, so that their ordering can be controlled to put the fully 
> optional Cloudera repo at the end, after required repos. (This was prompted 
> by the untimely failure of the Cloudera repo this week, which made the Spark 
> build fail. #2 would have prevented that.)
> 3) Update SBT build to match Maven build in this regard
> 4) Update SBT build to *not* refer to Sonatype snapshot repos. This wasn't in 
> Maven, and a build generally would not refer to external snapshots, but I'm 
> not 100% sure on this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1335:
--
Reporter: Sean Owen  (was: Sean Owen)

> Also increase perm gen / code cache for scalatest when invoked via Maven build
> --
>
> Key: SPARK-1335
> URL: https://issues.apache.org/jira/browse/SPARK-1335
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 1.0.0
>
>
> I am observing build failures when the Maven build reaches tests in the new 
> SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual 
> complaint from scala, that it's out of permgen space, or that JIT out of code 
> cache space.
> I see that various build scripts increase these both for SBT. This change 
> simply adds these settings to scalatest's arguments. Works for me and seems a 
> bit more consistent.
> (In the PR I'm going to tack on some other little changes too -- see PR.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1316) Remove use of Commons IO

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1316:
--
Reporter: Sean Owen  (was: Sean Owen)

> Remove use of Commons IO
> 
>
> Key: SPARK-1316
> URL: https://issues.apache.org/jira/browse/SPARK-1316
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> (This follows from a side point on SPARK-1133, in discussion of the PR: 
> https://github.com/apache/spark/pull/164 )
> Commons IO is barely used in the project, and can easily be replaced with 
> equivalent calls to Guava or the existing Spark Utils.scala class.
> Removing a dependency feels good, and this one in particular can get a little 
> problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2341:
--
Assignee: Sean Owen  (was: Sean Owen)

> loadLibSVMFile doesn't handle regression datasets
> -
>
> Key: SPARK-2341
> URL: https://issues.apache.org/jira/browse/SPARK-2341
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Eustache
>Assignee: Sean Owen
>Priority: Minor
>  Labels: easyfix
> Fix For: 1.1.0
>
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1071:
--
Assignee: Sean Owen  (was: Sean Owen)

> Tidy logging strategy and use of log4j
> --
>
> Key: SPARK-1071
> URL: https://issues.apache.org/jira/browse/SPARK-1071
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Input/Output
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> Prompted by a recent thread on the mailing list, I tried and failed to see if 
> Spark can be made independent of log4j. There are a few cases where control 
> of the underlying logging is pretty useful, and to do that, you have to bind 
> to a specific logger. 
> Instead I propose some tidying that leaves Spark's use of log4j, but gets rid 
> of warnings and should still enable downstream users to switch. The idea is 
> to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J 
> directly when logging, and where Spark needs to output info (REPL and tests), 
> bind from SLF4J to log4j.
> This leaves the same behavior in Spark. It means that downstream users who 
> want to use something except log4j should:
> - Exclude dependencies on log4j, slf4j-log4j12 from Spark
> - Include dependency on log4j-over-slf4j
> - Include dependency on another logger X, and another slf4j-X
> - Recreate any log config that Spark does, that is needed, in the other 
> logger's config
> That sounds about right.
> Here are the key changes: 
> - Include the jcl-over-slf4j shim everywhere by depending on it in core.
> - Exclude dependencies on commons-logging from third-party libraries.
> - Include the jul-to-slf4j shim everywhere by depending on it in core.
> - Exclude slf4j-* dependencies from third-party libraries to prevent 
> collision or warnings
> - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
> And minor/incidental changes:
> - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a 
> recommended update over 1.7.2
> - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
> - (Remove a duplicate mockito dependency declaration that was causing 
> warnings and bugging me)
> Pull request coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2798) Correct several small errors in Flume module pom.xml files

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2798:
--
Assignee: Sean Owen  (was: Sean Owen)

> Correct several small errors in Flume module pom.xml files
> --
>
> Key: SPARK-2798
> URL: https://issues.apache.org/jira/browse/SPARK-2798
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> (EDIT) Since the scalatest issue was since resolved, this is now about a few 
> small problems in the Flume Sink pom.xml 
> - scalatest is not declared as a test-scope dependency
> - Its Avro version doesn't match the rest of the build
> - Its Flume version is not synced with the other Flume module
> - The other Flume module declares its dependency on Flume Sink slightly 
> incorrectly, hard-coding the Scala 2.10 version
> - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1315) spark on yarn-alpha with mvn on master branch won't build

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1315:
--
Assignee: Sean Owen  (was: Sean Owen)

> spark on yarn-alpha with mvn on master branch won't build
> -
>
> Key: SPARK-1315
> URL: https://issues.apache.org/jira/browse/SPARK-1315
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Thomas Graves
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> I try to build off master branch using maven to build yarn-alpha but get the 
> following errors.
> mvn  -Dyarn.version=0.23.10 -Dhadoop.version=0.23.10  -Pyarn-alpha  clean 
> package -DskipTests 
> -
> [ERROR] 
> /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:25:
>  object runtime i
> s not a member of package reflect [ERROR] import 
> scala.reflect.runtime.universe.runtimeMirror
> [ERROR]  ^
> [ERROR] 
> /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:40:
>  not found: value runtimeMirror
> [ERROR]   private val mirror = runtimeMirror(classLoader)
> [ERROR]^
> [ERROR] 
> /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:92:
>  object tools is not a member of package scala
> [ERROR] scala.tools.nsc.io.File(".mima-excludes").
> [ERROR]   ^
> [ERROR] three errors found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2879) Use HTTPS to access Maven Central and other repos

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2879:
--
Assignee: Sean Owen  (was: Sean Owen)

> Use HTTPS to access Maven Central and other repos
> -
>
> Key: SPARK-2879
> URL: https://issues.apache.org/jira/browse/SPARK-2879
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> Maven Central has just now enabled HTTPS access for everyone to Maven Central 
> (http://central.sonatype.org/articles/2014/Aug/03/https-support-launching-now/)
>  This is timely, as a reminder of how easily an attacker can slip malicious 
> code into a build that's downloading artifacts over HTTP 
> (http://blog.ontoillogical.com/blog/2014/07/28/how-to-take-over-any-java-developer/).
> In the meantime, it looks like the Spring repo also now supports HTTPS, so 
> can be used this way too.
> I propose to use HTTPS to access these repos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-3803:
--
Assignee: Sean Owen  (was: Sean Owen)

> ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
> 
>
> Key: SPARK-3803
> URL: https://issues.apache.org/jira/browse/SPARK-3803
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Masaru Dobashi
>Assignee: Sean Owen
> Fix For: 1.2.0
>
>
> When I executed computePrincipalComponents method of RowMatrix, I got 
> java.lang.ArrayIndexOutOfBoundsException.
> {code}
> 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
> RDDFunctions.scala:111
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
> (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RowMatrix instance was generated from the result of TF-IDF like the 
> following.
> {code}
> scala> val hashingTF = new HashingTF()
> scala> val tf = hashingTF.transform(texts)
> scala> import org.apache.spark.mllib.feature.IDF
> scala> tf.cache()
> scala> val idf = new IDF().fit(tf)
> scala> val tfidf: RDD[Vector] = idf.transform(tf)
> scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> scala> val mat = new RowMatrix(tfidf)
> scala> val pc = mat.computePrincipalComponents(2)
> {code}
> I think this was because I created HashingTF instance with default 
> numFeatures and Array is used in RowMatrix#computeGramianMatrix method
> like the following.
> {code}
>   /**
>* Computes the Gramian matrix `A^T A`.
>*/
>   def computeGramianMatrix(): Matrix = {
> val n = numCols().toInt
> val nt: Int = n * (n + 1) / 2
> // Compute the upper triangular part of the gram matrix.
> val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
>   seqOp = (U, v) => {
> RowMatrix.dspr(1.0, v, U.data)
> U
>   }, combOp = (U1, U2) => U1 += U2)
> RowMatrix.triuToFull(n, GU.data)
>   }
> {code} 
> When the size of Vectors generated by TF-IDF is too large, it makes "nt" to 
> have undesirable value (and undesirable size of

[jira] [Updated] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2749:
--
Assignee: Sean Owen  (was: Sean Owen)

> Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
> junit:junit dep
> ---
>
> Key: SPARK-2749
> URL: https://issues.apache.org/jira/browse/SPARK-2749
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> The Maven-based builds in the build matrix have been failing for a few days:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/
> On inspection, it looks like the Spark SQL Java tests don't compile:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
> I confirmed it by repeating the command vs master:
> mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
> The problem is that this module doesn't depend on JUnit. In fact, none of the 
> modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
> in, in most places. However this module doesn't depend on 
> com.novocode:junit-interface
> Adding the junit:junit dependency fixes the compile problem. In fact, the 
> other modules with Java tests should probably depend on it explicitly instead 
> of happening to get it via com.novocode:junit-interface, since that is a bit 
> SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1556:
--
Assignee: Sean Owen  (was: Sean Owen)

> jets3t dep doesn't update properly with newer Hadoop versions
> -
>
> Key: SPARK-1556
> URL: https://issues.apache.org/jira/browse/SPARK-1556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
>Reporter: Nan Zhu
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
> S3ServiceException/ServiceException is introduced, however, Spark still 
> relies on Jet3st 0.7.x which has no definition of these classes
> What I met is that 
> [code]
> 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
> mapreduce.task.id
> 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
> use mapreduce.task.attempt.id
> 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
> Instead, use mapreduce.task.ismap
> 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
>   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
>   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
>   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
>   at $iwC$$iwC$$iwC$$iwC.(:15)
>   at $iwC$$iwC$$iwC.(:20)
>   at $iwC$$iwC.(:22)
>   at $iwC.(:24)
>   at (:26)
>   at .(:30)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
>   at 
> org.apache.spark.repl.SparkILoop.interp

[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1335:
--
Assignee: Sean Owen  (was: Sean Owen)

> Also increase perm gen / code cache for scalatest when invoked via Maven build
> --
>
> Key: SPARK-1335
> URL: https://issues.apache.org/jira/browse/SPARK-1335
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 1.0.0
>
>
> I am observing build failures when the Maven build reaches tests in the new 
> SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual 
> complaint from scala, that it's out of permgen space, or that JIT out of code 
> cache space.
> I see that various build scripts increase these both for SBT. This change 
> simply adds these settings to scalatest's arguments. Works for me and seems a 
> bit more consistent.
> (In the PR I'm going to tack on some other little changes too -- see PR.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2768) Add product, user recommend method to MatrixFactorizationModel

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2768:
--
Assignee: Sean Owen  (was: Sean Owen)

> Add product, user recommend method to MatrixFactorizationModel
> --
>
> Key: SPARK-2768
> URL: https://issues.apache.org/jira/browse/SPARK-2768
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> Right now, MatrixFactorizationModel can only predict a score for one or more 
> (user,product) tuples. As a comment in the file notes, it would be more 
> useful to expose a recommend method, that computes top N scoring products for 
> a user (or vice versa -- users for a product).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2748:
--
Assignee: Sean Owen  (was: Sean Owen)

> Loss of precision for small arguments to Math.exp, Math.log
> ---
>
> Key: SPARK-2748
> URL: https://issues.apache.org/jira/browse/SPARK-2748
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, MLlib
>Affects Versions: 1.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> In a few places in MLlib, an expression of the form log(1.0 + p) is 
> evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
> the correct answer is very near p. This is why Math.log1p exists.
> Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
> Math.expm1 method.
> While the errors occur only for very small arguments, given their use in 
> machine learning algorithms, this is entirely possible.
> Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
> / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
> don't think there's a JIRA on that one, so maybe this can serve as an 
> umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1973) Add randomSplit to JavaRDD (with tests, and tidy Java tests)

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1973:
--
Assignee: Sean Owen  (was: Sean Owen)

> Add randomSplit to JavaRDD (with tests, and tidy Java tests)
> 
>
> Key: SPARK-1973
> URL: https://issues.apache.org/jira/browse/SPARK-1973
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> I'd like to use randomSplit through the Java API, and would like to add a 
> convenience wrapper for this method to JavaRDD. This is fairly trivial. (In 
> fact, is the intent that JavaRDD not wrap every RDD method? and that 
> sometimes users should just use JavaRDD.wrapRDD()?)
> Along the way, I added tests for it, and also touched up the Java API test 
> style and behavior. This is maybe the more useful part of this small change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2745) Add Java friendly methods to Duration class

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2745:
--
Assignee: Sean Owen  (was: Sean Owen)

> Add Java friendly methods to Duration class
> ---
>
> Key: SPARK-2745
> URL: https://issues.apache.org/jira/browse/SPARK-2745
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1209) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1209:
--
Assignee: Sean Owen  (was: Sean Owen)

> SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
> --
>
> Key: SPARK-1209
> URL: https://issues.apache.org/jira/browse/SPARK-1209
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Sandy Ryza
>Assignee: Sean Owen
> Fix For: 1.2.0
>
>
> It's private, so the change won't break compatibility



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1316) Remove use of Commons IO

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1316:
--
Assignee: Sean Owen  (was: Sean Owen)

> Remove use of Commons IO
> 
>
> Key: SPARK-1316
> URL: https://issues.apache.org/jira/browse/SPARK-1316
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> (This follows from a side point on SPARK-1133, in discussion of the PR: 
> https://github.com/apache/spark/pull/164 )
> Commons IO is barely used in the project, and can easily be replaced with 
> equivalent calls to Guava or the existing Spark Utils.scala class.
> Removing a dependency feels good, and this one in particular can get a little 
> problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4170) Closure problems when running Scala app that "extends App"

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-4170:
--
Assignee: Sean Owen  (was: Sean Owen)

> Closure problems when running Scala app that "extends App"
> --
>
> Key: SPARK-4170
> URL: https://issues.apache.org/jira/browse/SPARK-4170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Michael Albert noted this problem on the mailing list 
> (http://apache-spark-user-list.1001560.n3.nabble.com/BUG-when-running-as-quot-extends-App-quot-closures-don-t-capture-variables-td17675.html):
> {code}
> object DemoBug extends App {
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
> val rdd = sc.parallelize(List("A","B","C","D"))
> val str1 = "A"
> val rslt1 = rdd.filter(x => { x != "A" }).count
> val rslt2 = rdd.filter(x => { str1 != null && x != "A" }).count
> 
> println("DemoBug: rslt1 = " + rslt1 + " rslt2 = " + rslt2)
> }
> {code}
> This produces the output:
> {code}
> DemoBug: rslt1 = 3 rslt2 = 0
> {code}
> If instead there is a proper "main()", it works as expected.
> I also this week noticed that in a program which "extends App", some values 
> were inexplicably null in a closure. When changing to use main(), it was fine.
> I assume there is a problem with variables not being added to the closure 
> when main() doesn't appear in the standard way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1727) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1727:
--
Assignee: Sean Owen  (was: Sean Owen)

> Correct small compile errors, typos, and markdown issues in (primarly) MLlib 
> docs
> -
>
> Key: SPARK-1727
> URL: https://issues.apache.org/jira/browse/SPARK-1727
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> While play-testing the Scala and Java code examples in the MLlib docs, I 
> noticed a number of small compile errors, and some typos. This led to finding 
> and fixing a few similar items in other docs. 
> Then in the course of building the site docs to check the result, I found a 
> few small suggestions for the build instructions. I also found a few more 
> formatting and markdown issues uncovered when I accidentally used maruku 
> instead of kramdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1789:
--
Assignee: Sean Owen  (was: Sean Owen)

> Multiple versions of Netty dependencies cause FlumeStreamSuite failure
> --
>
> Key: SPARK-1789
> URL: https://issues.apache.org/jira/browse/SPARK-1789
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>  Labels: flume, netty, test
> Fix For: 1.0.0
>
>
> TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly 
> resolved and will resolve a test failure.
> I hit the error described at 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html
>  while running FlumeStreamingSuite, and have for a short while (is it just 
> me?)
> velvia notes:
> "I have found a workaround.  If you add akka 2.2.4 to your dependencies, then 
> everything works, probably because akka 2.2.4 brings in newer version of 
> Jetty." 
> There are at least 3 versions of Netty in play in the build:
> - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and 
> that is the immediate problem
> - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
> - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
> The POMs try to exclude other versions of netty, but are excluding 
> org.jboss.netty:netty, when in fact older versions of io.netty:netty (not 
> netty-all) are also an issue.
> The org.jboss.netty:netty excludes are largely unnecessary. I replaced many 
> of them with io.netty:netty exclusions until everything agreed on 
> io.netty:netty-all:4.0.17.Final.
> But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. 
> Down-grading to 3.6.6.Final across the board made some Spark code not compile.
> If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to 
> work. Part of the reason seems to be that Netty 3.x used the old 
> `org.jboss.netty` packages. This is less than ideal, but is no worse than the 
> current situation. 
> So this PR resolves the issue and improves the JAR hell, even if it leaves 
> the existing theoretical Netty 3-vs-4 conflict:
> - Remove org.jboss.netty excludes where possible, for clarity; they're not 
> needed except with Hadoop artifacts
> - Add io.netty:netty excludes where needed -- except, let akka keep its 
> io.netty:netty
> - Change a bit of test code that actually depended on Netty 3.x, to use 4.x 
> equivalent
> - Update SBT build accordingly
> A better change would be to update Akka far enough such that it agrees on 
> Netty 4.x, but I don't know if that's feasible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1802:
--
Assignee: Sean Owen  (was: Sean Owen)

> Audit dependency graph when Spark is built with -Phive
> --
>
> Key: SPARK-1802
> URL: https://issues.apache.org/jira/browse/SPARK-1802
> Project: Spark
>  Issue Type: Bug
>Reporter: Patrick Wendell
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: hive-exec-jar-problems.txt
>
>
> I'd like to have binary release for 1.0 include Hive support. Since this 
> isn't enabled by default in the build I don't think it's as well tested, so 
> we should dig around a bit and decide if we need to e.g. add any excludes.
> {code}
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -pl 
> assembly | grep -v INFO | tr ":" "\n" |  awk ' { FS="/"; print ( $(NF) ); }' 
> | sort > without_hive.txt
> $ mvn install -Phive -DskipTests && mvn dependency:build-classpath -Phive -pl 
> assembly | grep -v INFO | tr ":" "\n" |  awk ' { FS="/"; print ( $(NF) ); }' 
> | sort > with_hive.txt
> $ diff without_hive.txt with_hive.txt
> < antlr-2.7.7.jar
> < antlr-3.4.jar
> < antlr-runtime-3.4.jar
> 10,14d6
> < avro-1.7.4.jar
> < avro-ipc-1.7.4.jar
> < avro-ipc-1.7.4-tests.jar
> < avro-mapred-1.7.4.jar
> < bonecp-0.7.1.RELEASE.jar
> 22d13
> < commons-cli-1.2.jar
> 25d15
> < commons-compress-1.4.1.jar
> 33,34d22
> < commons-logging-1.1.1.jar
> < commons-logging-api-1.0.4.jar
> 38d25
> < commons-pool-1.5.4.jar
> 46,49d32
> < datanucleus-api-jdo-3.2.1.jar
> < datanucleus-core-3.2.2.jar
> < datanucleus-rdbms-3.2.1.jar
> < derby-10.4.2.0.jar
> 53,57d35
> < hive-common-0.12.0.jar
> < hive-exec-0.12.0.jar
> < hive-metastore-0.12.0.jar
> < hive-serde-0.12.0.jar
> < hive-shims-0.12.0.jar
> 60,61d37
> < httpclient-4.1.3.jar
> < httpcore-4.1.3.jar
> 68d43
> < JavaEWAH-0.3.2.jar
> 73d47
> < javolution-5.5.1.jar
> 76d49
> < jdo-api-3.0.1.jar
> 78d50
> < jetty-6.1.26.jar
> 87d58
> < jetty-util-6.1.26.jar
> 93d63
> < json-20090211.jar
> 98d67
> < jta-1.1.jar
> 103,104d71
> < libfb303-0.9.0.jar
> < libthrift-0.9.0.jar
> 112d78
> < mockito-all-1.8.5.jar
> 136d101
> < servlet-api-2.5-20081211.jar
> 139d103
> < snappy-0.2.jar
> 144d107
> < spark-hive_2.10-1.0.0.jar
> 151d113
> < ST4-4.0.4.jar
> 153d114
> < stringtemplate-3.2.1.jar
> 156d116
> < velocity-1.7.jar
> 158d117
> < xz-1.0.jar
> {code}
> Some initial investigation suggests we may need to take some precaution 
> surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1248) Spark build error with Apache Hadoop(Cloudera CDH4)

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1248:
--
Assignee: Sean Owen  (was: Sean Owen)

> Spark build error with Apache Hadoop(Cloudera CDH4)
> ---
>
> Key: SPARK-1248
> URL: https://issues.apache.org/jira/browse/SPARK-1248
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Sean Owen
> Fix For: 1.0.0
>
>
> {code}
> SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true sbt/sbt assembly -d > 
> error.log
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1120) Send all dependency logging through slf4j

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1120:
--
Assignee: Sean Owen  (was: Sean Owen)

> Send all dependency logging through slf4j
> -
>
> Key: SPARK-1120
> URL: https://issues.apache.org/jira/browse/SPARK-1120
> Project: Spark
>  Issue Type: Improvement
>Reporter: Patrick Cogan
>Assignee: Sean Owen
> Fix For: 1.0.0
>
>
> There are a few dependencies that pull in other logging frameworks which 
> don't get routed correctly. We should include the relevant slf4j adapters and 
> exclude those logging libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2363) Clean MLlib's sample data files

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2363:
--
Assignee: Sean Owen  (was: Sean Owen)

> Clean MLlib's sample data files
> ---
>
> Key: SPARK-2363
> URL: https://issues.apache.org/jira/browse/SPARK-2363
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> MLlib has sample data under serveral folders:
> 1) data/mllib
> 2) data/
> 3) mllib/data/*
> Per previous discussion with [~matei], we want to put them under `data/mllib` 
> and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1254:
--
Assignee: Sean Owen  (was: Sean Owen)

> Consolidate, order, and harmonize repository declarations in Maven/SBT builds
> -
>
> Key: SPARK-1254
> URL: https://issues.apache.org/jira/browse/SPARK-1254
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> This suggestion addresses a few minor suboptimalities with how repositories 
> are handled.
> 1) Use HTTPS consistently to access repos, instead of HTTP
> 2) Consolidate repository declarations in the parent POM file, in the case of 
> the Maven build, so that their ordering can be controlled to put the fully 
> optional Cloudera repo at the end, after required repos. (This was prompted 
> by the untimely failure of the Cloudera repo this week, which made the Spark 
> build fail. #2 would have prevented that.)
> 3) Update SBT build to match Maven build in this regard
> 4) Update SBT build to *not* refer to Sonatype snapshot repos. This wasn't in 
> Maven, and a build generally would not refer to external snapshots, but I'm 
> not 100% sure on this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1996) Remove use of special Maven repo for Akka

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1996:
--
Assignee: Sean Owen  (was: Sean Owen)

> Remove use of special Maven repo for Akka
> -
>
> Key: SPARK-1996
> URL: https://issues.apache.org/jira/browse/SPARK-1996
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Reporter: Matei Zaharia
>Assignee: Sean Owen
> Fix For: 1.1.0
>
>
> According to http://doc.akka.io/docs/akka/2.3.3/intro/getting-started.html 
> Akka is now published to Maven Central, so our documentation and POM files 
> don't need to use the old Akka repo. It will be one less step for users to 
> worry about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1827) LICENSE and NOTICE files need a refresh to contain transitive dependency info

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1827:
--
Assignee: Sean Owen  (was: Sean Owen)

> LICENSE and NOTICE files need a refresh to contain transitive dependency info
> -
>
> Key: SPARK-1827
> URL: https://issues.apache.org/jira/browse/SPARK-1827
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 1.0.0
>
>
> (Pardon marking it a blocker, but think it needs doing before 1.0 per chat 
> with [~pwendell])
> The LICENSE and NOTICE files need to cover all transitive dependencies, since 
> these are all distributed in the assembly jar. (c.f. 
> http://www.apache.org/dev/licensing-howto.html )
> I don't believe the current files cover everything. It's possible to 
> mostly-automatically generate these. I will generate this and propose a patch 
> to both today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-15 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278453#comment-14278453
 ] 

Guoqiang Li commented on SPARK-1405:


We can use the demo scripts in word2vec to get the same corpus. 
{code}
normalize_text() {
  awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
"s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
  -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ ( 
/g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
  -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
  -e 's/«/ /g' | tr 0-9 " "
}
wget 
http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
gzip -d news.2013.en.shuffled.gz
normalize_text < news.2013.en.shuffled > data.txt
{code}

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1798) Tests should clean up temp files

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1798:
--
Assignee: Sean Owen  (was: Sean Owen)

> Tests should clean up temp files
> 
>
> Key: SPARK-1798
> URL: https://issues.apache.org/jira/browse/SPARK-1798
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.0.0
>
>
> Three issues related to temp files that tests generate -- these should be 
> touched up for hygiene but are not urgent.
> Modules have a log4j.properties which directs the unit-test.log output file 
> to a directory like [module]/target/unit-test.log. But this ends up creating 
> [module]/[module]/target/unit-test.log instead of former.
> The work/ directory is not deleted by "mvn clean", in the parent and in 
> modules. Neither is the checkpoint/ directory created under the various 
> external modules.
> Many tests create a temp directory, which is not usually deleted. This can be 
> largely resolved by calling deleteOnExit() at creation and trying to call 
> Utils.deleteRecursively consistently to clean up, sometimes in an "@After" 
> method.
> (If anyone seconds the motion, I can create a more significant change that 
> introduces a new test trait along the lines of LocalSparkContext, which 
> provides management of temp directories for subclasses to take advantage of.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3356) Document when RDD elements' ordering within partitions is nondeterministic

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-3356:
--
Assignee: Sean Owen  (was: Sean Owen)

> Document when RDD elements' ordering within partitions is nondeterministic
> --
>
> Key: SPARK-3356
> URL: https://issues.apache.org/jira/browse/SPARK-3356
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Sean Owen
> Fix For: 1.2.0
>
>
> As reported in SPARK-3098 for example, for users using zipWithIndex, 
> zipWithUniqueId, etc, (and maybe even things like mapPartitions) it's 
> confusing that the order of elements in each partition after a shuffle 
> operation is nondeterministic (unless the operation was sortByKey). We should 
> explain this in the docs for the zip and partition-wise operations.
> Another subtle issue is that the order of values for each key in groupBy / 
> join / etc can be nondeterministic -- we need to explain that too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2955) Test code fails to compile with "mvn compile" without "install"

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2955:
--
Assignee: Sean Owen  (was: Sean Owen)

> Test code fails to compile with "mvn compile" without "install" 
> 
>
> Key: SPARK-2955
> URL: https://issues.apache.org/jira/browse/SPARK-2955
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: build, compile, scalatest, test, test-compile
> Fix For: 1.2.0
>
>
> (This is the corrected follow-up to 
> https://issues.apache.org/jira/browse/SPARK-2903 )
> Right now, "mvn compile test-compile" fails to compile Spark. (Don't worry; 
> "mvn package" works, so this is not major.) The issue stems from test code in 
> some modules depending on test code in other modules. That is perfectly fine 
> and supported by Maven.
> It takes extra work to get this to work with scalatest, and this has been 
> attempted: 
> https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
> This formulation is not quite enough, since the SQL Core module's tests fail 
> to compile for lack of finding test classes in SQL Catalyst, and likewise for 
> most Streaming integration modules depending on core Streaming test code. 
> Example:
> {code}
> [error] 
> /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23:
>  not found: type PlanTest
> [error] class QueryTest extends PlanTest {
> [error] ^
> [error] 
> /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28:
>  package org.apache.spark.sql.test is not a value
> [error]   test("SPARK-1669: cacheTable should be idempotent") {
> [error]   ^
> ...
> {code}
> The issue I believe is that generation of a test-jar is bound here to the 
> compile phase, but the test classes are not being compiled in this phase. It 
> should bind to the test-compile phase.
> It works when executing "mvn package" or "mvn install" since test-jar 
> artifacts are actually generated available through normal Maven mechanisms as 
> each module is built. They are then found normally, regardless of scalatest 
> configuration.
> It would be nice for a simple "mvn compile test-compile" to work since the 
> test code is perfectly compilable given the Maven declarations.
> On the plus side, this change is low-risk as it only affects tests.
> [~yhuai] made the original scalatest change and has glanced at this and 
> thinks it makes sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2034) KafkaInputDStream doesn't close resources and may prevent JVM shutdown

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2034:
--
Assignee: Sean Owen  (was: Sean Owen)

> KafkaInputDStream doesn't close resources and may prevent JVM shutdown
> --
>
> Key: SPARK-2034
> URL: https://issues.apache.org/jira/browse/SPARK-2034
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 1.0.1, 1.1.0
>
>
> Tobias noted today on the mailing list:
> {quote}
> I am trying to use Spark Streaming with Kafka, which works like a
> charm -- except for shutdown. When I run my program with "sbt
> run-main", sbt will never exit, because there are two non-daemon
> threads left that don't die.
> I created a minimal example at
> .
> It starts a StreamingContext and does nothing more than connecting to
> a Kafka server and printing what it receives. Using the `future { ...
> }` construct, I shut down the StreamingContext after some seconds and
> then print the difference between the threads at start time and at end
> time. The output can be found at
> .
> There are a number of threads remaining that will prevent sbt from
> exiting.
> When I replace `KafkaUtils.createStream(...)` with a call that does
> exactly the same, except that it calls `consumerConnector.shutdown()`
> in `KafkaReceiver.onStop()` (which it should, IMO), the output is as
> shown at 
> .
> Does anyone have *any* idea what is going on here and why the program
> doesn't shut down properly? The behavior is the same with both kafka
> 0.8.0 and 0.8.1.1, by the way.
> {quote}
> Something similar was noted last year:
> http://mail-archives.apache.org/mod_mbox/spark-dev/201309.mbox/%3c1380220041.2428.yahoomail...@web160804.mail.bf1.yahoo.com%3E
>  
> KafkaInputDStream doesn't close ConsumerConnector in onStop(), and does not 
> close the Executor it creates. The latter leaves non-daemon threads and can 
> prevent the JVM from shutting down even if streaming is closed properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1084) Fix most build warnings

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1084:
--
Assignee: Sean Owen  (was: Sean Owen)

> Fix most build warnings
> ---
>
> Key: SPARK-1084
> URL: https://issues.apache.org/jira/browse/SPARK-1084
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 0.9.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: mvn, sbt, warning
> Fix For: 1.0.0
>
>
> I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of 
> the warnings that appear during build, so that developers don't become 
> accustomed to them. The accompanying pull request contains a number of 
> commits to quash most warnings observed through the mvn and sbt builds, 
> although not all of them.
> FIXED!
> [WARNING] Parameter tasks is deprecated, use target instead
> Just a matter of updating  ->  in inline Ant scripts.
> WARNING: -p has been deprecated and will be reused for a different (but still 
> very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R.
> Goes away with updating scalatest plugin -> 1.0-RC2
> [WARNING] Note: 
> /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java
>  uses unchecked or unsafe operations.
> [WARNING] Note: Recompile with -Xlint:unchecked for details.
> Mostly @SuppressWarnings("unchecked") but needed a few more things to reveal 
> the warning source: true (also needd for ) and version 
> 3.1 of the plugin. In a few cases some declaration changes were appropriate 
> to avoid warnings.
> /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25:
>  warning: Could not find any member to link for "akka.actor.ActorSystem".
> /**
> ^
> Getting several scaladoc errors like this and I'm not clear why it can't find 
> the type -- outside its module? Remove the links as they're evidently not 
> linking anyway?
> /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86:
>  warning: Variable eval undefined in comment for class SparkIMain in class 
> SparkIMain
> $ has to be escaped as \$ in scaladoc, apparently
> [WARNING] 
> 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId'
>  for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a 
> valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, 
> /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25
> This one might need review.
> This is valid Maven syntax, but, Maven still warns on it. I wanted to see if 
> we can do without it. 
> These are trying to exclude:
> - org.codehaus.jackson
> - org.sonatype.sisu.inject
> - org.xerial.snappy
> org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. 
> org.xerial.snappy is used by dependencies but the version seems to match 
> anyway (1.0.5).
> org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming 
> wants 1.9.11 directly. But the exclusion is in the wrong place if so, since 
> Spark depends straight on Avro, which is what brings in 1.8.8, still. 
> (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but 
> the other Hadoop modules don't.)
> HBase depends on 1.8.8 but figured it was intentional to leave that as it 
> would not collide with Spark streaming. (?)
> (I understand this varies by Hadoop version but confirmed this is all the 
> same for 1.0.4, 0.23.7, 2.2.0.)
> NOT FIXED.
> [warn] 
> /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305:
>  method connect in class IOManager is deprecated: use the new implementation 
> in package akka.io instead
> [warn]   override def preStart = IOManager(context.system).connect(new 
> InetSocketAddress(port))
> Not confident enough to fix this.
> [WARNING] there were 6 feature warning(s); re-run with -feature for details
> Don't know enough Scala to address these, yet.
> [WARNING] We have a duplicate 
> org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in 
> /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar
> Probably addressable by being more careful about how binaries are packed 
> though this appear to be ignorable; two identical copies of the class are 
> colliding.
> [WARNING] Zinc server is not available at port 3030 - reverting to normal 
> incremental compile
> and
> [WARNING] JAR will be empty - no content was marked for inclusion!
> Apparently harmless warnings, but I don't know how to disable them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubsc

[jira] [Updated] (SPARK-1663) Spark Streaming docs code has several small errors

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1663:
--
Assignee: Sean Owen  (was: Sean Owen)

> Spark Streaming docs code has several small errors
> --
>
> Key: SPARK-1663
> URL: https://issues.apache.org/jira/browse/SPARK-1663
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 0.9.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: streaming
> Fix For: 1.0.0
>
>
> The changes are easiest to elaborate in the PR, which I will open shortly.
> Those changes raised a few little questions about the API too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2602) sbt/sbt test steals window focus on OS X

2015-01-15 Thread Tony Stevenson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2602:
--
Assignee: Sean Owen  (was: Sean Owen)

> sbt/sbt test steals window focus on OS X
> 
>
> Key: SPARK-2602
> URL: https://issues.apache.org/jira/browse/SPARK-2602
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.1.0
>
>
> On OS X, I run {{sbt/sbt test}} from Terminal and then go off and do 
> something else with my computer. It appears that there are several things in 
> the test suite that launch Java programs that, for some reason, steal window 
> focus. 
> It can get very annoying, especially if you happen to be typing something in 
> a different window, to be suddenly teleported to a random Java application 
> and have your finely crafted keystrokes be sent where they weren't intended.
> It would be nice if {{sbt/sbt test}} didn't do that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5263) `create table` DDL need to check if table exists first

2015-01-15 Thread shengli (JIRA)

shengli created SPARK-5263:
--

 Summary: `create table` DDL  need to check if table exists first
 Key: SPARK-5263
 URL: https://issues.apache.org/jira/browse/SPARK-5263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0


`create table` DDL  need to check if table exists first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5263) `create table` DDL need to check if table exists first

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278476#comment-14278476
 ] 

Apache Spark commented on SPARK-5263:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4058

> `create table` DDL  need to check if table exists first
> ---
>
> Key: SPARK-5263
> URL: https://issues.apache.org/jira/browse/SPARK-5263
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: shengli
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> `create table` DDL  need to check if table exists first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5264) support `drop table` DDL command

2015-01-15 Thread shengli (JIRA)

shengli created SPARK-5264:
--

 Summary: support `drop table` DDL command 
 Key: SPARK-5264
 URL: https://issues.apache.org/jira/browse/SPARK-5264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0


support `drop table` DDL command 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-01-15 Thread Takumi Yoshida (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278502#comment-14278502
 ] 

Takumi Yoshida commented on SPARK-5243:
---

Hi!

I found, Spark hangs with following situation. i guess there would be some 
other condition.

> 1. the cluster has only one worker.
yes, running standalone.
 
> 2. driver memory + executor memory > worker memory
I use following settings, but it hang.

driver memory = 1g
executor memory = 1g
worker memory = 3g

> 3. deploy-mode = cluster
no, deploy-mode was "client" as default.

I use follwing code.
 https://gist.github.com/yoshi0309/33bd912d91c0bb5cdf30

command.
 ./bin/spark-submit ./ldgourmetALS.py s3n://abc-takumiyoshida/datasets/ 
--driver-memory 1g

machine.
 Amazon EC2 / m3.medium (3ECU and 3.75GB RAM)






> Spark will hang if (driver memory + executor memory) exceeds limit on a 
> 1-worker cluster
> 
>
> Key: SPARK-5243
> URL: https://issues.apache.org/jira/browse/SPARK-5243
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.2.0
> Environment: centos, others should be similar
>Reporter: yuhao yang
>Priority: Minor
>
> Spark will hang if calling spark-submit under the conditions:
> 1. the cluster has only one worker.
> 2. driver memory + executor memory > worker memory
> 3. deploy-mode = cluster
> This usually happens during development for beginners.
> There should be some exit mechanism or at least a warning message in the 
> output of the spark-submit.
> I am preparing PR for the case. And I would like to know your opinions about 
> if a fix is needed and better fix options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Grigor updated SPARK-5246:
---
Description: 
##How to reproduce: 
1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
10.1.1.62:  at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
10.1.1.62:  ... 12 more
10.1.1.62: full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
[timing] spark-standalone setup:  00h 00m 28s
 
(omitted for brevity)
{code}

/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
{code}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
:::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 
8080


15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
HUP, INT]
Exception in thread "main" java.net.UnknownHostException: ip-10-1-1-151: 
ip-10-1-1-151: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
at 
org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
at 
org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
at 
org.apache.spark.deploy.master.MasterArguments.(MasterArguments.scala:27)
at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
{code}

Problem is that instance launched in VPC may be not able to resolve own local 
hostname. Please see  https://forums.aws.amazon.com/thread.jspa?threadID=92092.

I am going to submit a fix for this problem since I need this functionality 
asap.


## How to reproduce

  was:
How to reproduce: 
1) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at java.net.

[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Grigor updated SPARK-5246:
---
Description: 
How to reproduce: 

1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
10.1.1.62:  at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
10.1.1.62:  ... 12 more
10.1.1.62: full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
[timing] spark-standalone setup:  00h 00m 28s
 
(omitted for brevity)
{code}

/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
{code}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
:::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 
8080


15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
HUP, INT]
Exception in thread "main" java.net.UnknownHostException: ip-10-1-1-151: 
ip-10-1-1-151: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
at 
org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
at 
org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
at 
org.apache.spark.deploy.master.MasterArguments.(MasterArguments.scala:27)
at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
{code}

Problem is that instance launched in VPC may be not able to resolve own local 
hostname. Please see  https://forums.aws.amazon.com/thread.jspa?threadID=92092.

I am going to submit a fix for this problem since I need this functionality 
asap.


  was:
##How to reproduce: 
1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.s

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278614#comment-14278614
 ] 

Apache Spark commented on SPARK-5012:
-

User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/4059

> Python API for Gaussian Mixture Model
> -
>
> Key: SPARK-5012
> URL: https://issues.apache.org/jira/browse/SPARK-5012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Meethu Mathew
>
> Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5264) support `drop table` DDL command

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278639#comment-14278639
 ] 

Apache Spark commented on SPARK-5264:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4060

> support `drop table` DDL command 
> -
>
> Key: SPARK-5264
> URL: https://issues.apache.org/jira/browse/SPARK-5264
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: shengli
>Priority: Minor
> Fix For: 1.3.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> support `drop table` DDL command 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-01-15 Thread Roque Vassal'lo (JIRA)

Roque Vassal'lo created SPARK-5265:
--

 Summary: Submitting applications on Standalone cluster controlled 
by Zookeeper forces to know active master
 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo


Hi, this is my first JIRA here, so I hope it is clear enough.

I'm using Spark 1.2.0 and trying to submit an application on a Spark Standalone 
cluster in cluster deploy mode with supervise.

Standalone cluster is running in high availability mode, using Zookeeper to 
provide leader election between three available Masters (named master1, master2 
and master3).

As read at Spark's documentation, to register a Worker to the Standalone 
cluster, I provide complete cluster info as the spark route.
I mean, spark://master1:7077,master2:7077,master3:7077
and that route is parsed and three attempts are launched, first one to 
master1:7077, second one to master2:7077 and third one to master3:7077.
This works great!

But if I try to do the same while submitting applications, it fails.
I mean, if I provide complete cluster info as the --master option to 
spark-submit script, it throws an exception because it tries to connect as it 
was a single node.
Example:
spark-submit --class org.apache.spark.examples.SparkPi --master 
spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
--supervise examples.jar 100

This is the output I got:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mytest); users 
with modify permissions: Set(mytest)
15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
port 53930.
15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
spark://master1:7077,master2:7077,master3:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://master1:7077,master2:7077,master3:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more


Shouldn't it parse it as on Worker registration?
It will not force client to know which is the current active Master of the 
Standalone cluster.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Lianhui Wang (JIRA)

Lianhui Wang created SPARK-5266:
---

 Summary: numExecutorsFailed should exclude number of killExecutor 
in yarn mode
 Key: SPARK-5266
 URL: https://issues.apache.org/jira/browse/SPARK-5266
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Lianhui Wang


when driver request killExecutor, am will kill container and numExecutorsFailed 
will increment. when numExecutorsFailed> maxNumExecutorFailures in AM, AM will 
exit with EXIT_MAX_EXECUTOR_FAILURES reason. so numExecutorsFailed should 
exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278684#comment-14278684
 ] 

Apache Spark commented on SPARK-5266:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4061

> numExecutorsFailed should exclude number of killExecutor in yarn mode
> -
>
> Key: SPARK-5266
> URL: https://issues.apache.org/jira/browse/SPARK-5266
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> when driver request killExecutor, am will kill container and 
> numExecutorsFailed will increment. when numExecutorsFailed> 
> maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES 
> reason. so numExecutorsFailed should exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5266) numExecutorsFailed should exclude number of killExecutor in yarn mode

2015-01-15 Thread Lianhui Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang closed SPARK-5266.
---
Resolution: Fixed

> numExecutorsFailed should exclude number of killExecutor in yarn mode
> -
>
> Key: SPARK-5266
> URL: https://issues.apache.org/jira/browse/SPARK-5266
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Lianhui Wang
>
> when driver request killExecutor, am will kill container and 
> numExecutorsFailed will increment. when numExecutorsFailed> 
> maxNumExecutorFailures in AM, AM will exit with EXIT_MAX_EXECUTOR_FAILURES 
> reason. so numExecutorsFailed should exclude the killExecutor from driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4943) Parsing error for query with table name having dot

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278710#comment-14278710
 ] 

Apache Spark commented on SPARK-4943:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/4062

> Parsing error for query with table name having dot
> --
>
> Key: SPARK-4943
> URL: https://issues.apache.org/jira/browse/SPARK-4943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Alex Liu
> Fix For: 1.3.0, 1.2.1
>
>
> When integrating Spark 1.2.0 with Cassandra SQL, the following query is 
> broken. It was working for Spark 1.1.0 version. Basically we use the table 
> name having dot to include database name 
> {code}
> [info]   java.lang.RuntimeException: [1.29] failure: ``UNION'' expected but 
> `.' found
> [info] 
> [info] SELECT test1.a FROM sql_test.test1 AS test1 UNION DISTINCT SELECT 
> test2.a FROM sql_test.test2 AS test2
> [info] ^
> [info]   at scala.sys.package$.error(package.scala:27)
> [info]   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
> [info]   at 
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
> [info]   at 
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
> [info]   at 
> org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:174)
> [info]   at 
> org.apache.spark.sql.catalyst.SparkSQLParser$$anonfun$org$apache$spark$sql$catalyst$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:173)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> [info]   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> [info]   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> [info]   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> [info]   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> [info]   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> [info]   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> [info]   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> [info]   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> [info]   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:31)
> [info]   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
> [info]   at 
> org.apache.spark.sql.SQLContext$$anonfun$parseSql$1.apply(SQLContext.scala:83)
> [info]   at scala.Option.getOrElse(Option.scala:120)
> [info]   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:83)
> [info]   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.cassandraSql(CassandraSQLContext.scala:53)
> [info]   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.sql(CassandraSQLContext.scala:56)
> [info]   at 
> com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply$mcV$sp(CassandraSQLSpec.scala:169)
> [info]   at 
> com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
> [info]   at 
> com.datastax.spark.connector.sql.CassandraSQLSpec$$anonfun$20.apply(CassandraSQLSpec.scala:168)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647)
> [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> [info]   at org.scalatest.FlatSpec.withFixture(FlatSpec.sc

[jira] [Created] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-15 Thread Steve Brewin (JIRA)

Steve Brewin created SPARK-5267:
---

 Summary: Add a streaming module to ingest Apache Camel Messages 
from a configured endpoints
 Key: SPARK-5267
 URL: https://issues.apache.org/jira/browse/SPARK-5267
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Steve Brewin


The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which support many more input protocols. Our tried and tested 
implementation of this proposal is "spark-streaming-camel". 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278780#comment-14278780
 ] 

Vladimir Grigor commented on SPARK-5246:


https://github.com/mesos/spark-ec2/pull/91

> spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does 
> not resolve
> --
>
> Key: SPARK-5246
> URL: https://issues.apache.org/jira/browse/SPARK-5246
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Vladimir Grigor
>
> How to reproduce: 
> 1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
> should be sufficient to setup VPC for this bug. After you followed that 
> guide, start new instance in VPC, ssh to it (though NAT server)
> 2) user starts a cluster in VPC:
> {code}
> ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
> --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
> --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
> Setting up security groups...
> 
> (omitted for brevity)
> 10.1.1.62
> 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
> no org.apache.spark.deploy.master.Master to stop
> starting org.apache.spark.deploy.master.Master, logging to 
> /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
> failed to launch org.apache.spark.deploy.master.Master:
>   at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
>   ... 12 more
> full log in 
> /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
> 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
> /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
> 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
> 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
> 10.1.1.62:... 12 more
> 10.1.1.62: full log in 
> /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
> [timing] spark-standalone setup:  00h 00m 28s
>  
> (omitted for brevity)
> {code}
> /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
> {code}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
> :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
>  -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
> org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 
> --webui-port 8080
> 
> 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
> HUP, INT]
> Exception in thread "main" java.net.UnknownHostException: ip-10-1-1-151: 
> ip-10-1-1-151: Name or service not known
> at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
> at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
> at 
> org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
> at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
> at 
> org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
> at 
> org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
> at 
> org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
> at 
> org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
> at 
> org.apache.spark.deploy.master.MasterArguments.(MasterArguments.scala:27)
> at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
> at org.apache.spark.deploy.master.Master.main(Master.scala)
> Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
> known
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
> at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
> at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
> ... 12 more
> {code}
> Problem is that instance launched in VPC may be not able to resolve own local 
> hostname. Please see  
> https://forums.aws.amazon.com/thread.jspa?threadID=92092.
> I am going to submit a fix for this problem since I need this functionality 
> asap.



--
This message was sent by Atlassian JIR

[jira] [Created] (SPARK-5268) ExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Nan Zhu (JIRA)

Nan Zhu created SPARK-5268:
--

 Summary: ExecutorBackend exits for irrelevant DisassociatedEvent
 Key: SPARK-5268
 URL: https://issues.apache.org/jira/browse/SPARK-5268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Nan Zhu


In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
backend actor and exit the program upon receive such event...

let's consider the following case

The user may develop an Akka-based program which starts the actor with Spark's 
actor system and communicate with an external actor system (e.g. an Akka-based 
receiver in spark streaming which communicates with an external system)  If the 
external actor system fails or disassociates with the actor within spark's 
system with purpose, we may receive DisassociatedEvent and the executor is 
restarted.

This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5268) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Nan Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-5268:
---
Summary: CoarseGrainedExecutorBackend exits for irrelevant 
DisassociatedEvent  (was: ExecutorBackend exits for irrelevant 
DisassociatedEvent)

> CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent
> 
>
> Key: SPARK-5268
> URL: https://issues.apache.org/jira/browse/SPARK-5268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Nan Zhu
>
> In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
> backend actor and exit the program upon receive such event...
> let's consider the following case
> The user may develop an Akka-based program which starts the actor with 
> Spark's actor system and communicate with an external actor system (e.g. an 
> Akka-based receiver in spark streaming which communicates with an external 
> system)  If the external actor system fails or disassociates with the actor 
> within spark's system with purpose, we may receive DisassociatedEvent and the 
> executor is restarted.
> This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5268) ExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278786#comment-14278786
 ] 

Apache Spark commented on SPARK-5268:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/4063

> ExecutorBackend exits for irrelevant DisassociatedEvent
> ---
>
> Key: SPARK-5268
> URL: https://issues.apache.org/jira/browse/SPARK-5268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Nan Zhu
>
> In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
> backend actor and exit the program upon receive such event...
> let's consider the following case
> The user may develop an Akka-based program which starts the actor with 
> Spark's actor system and communicate with an external actor system (e.g. an 
> Akka-based receiver in spark streaming which communicates with an external 
> system)  If the external actor system fails or disassociates with the actor 
> within spark's system with purpose, we may receive DisassociatedEvent and the 
> executor is restarted.
> This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance

2015-01-15 Thread Ivan Vergiliev (JIRA)

Ivan Vergiliev created SPARK-5269:
-

 Summary: BlockManager.dataDeserialize always creates a new 
serializer instance
 Key: SPARK-5269
 URL: https://issues.apache.org/jira/browse/SPARK-5269
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ivan Vergiliev


BlockManager.dataDeserialize always creates a new instance of the serializer, 
which is pretty slow in some cases. I'm using Kryo serialization and have a 
custom registrator, and its register method is showing up as taking about 15% 
of the execution time in my profiles. This started happening after I increased 
the number of keys in a job with a shuffle phase by a factor of 40.

One solution I can think of is to create a ThreadLocal SerializerInstance for 
the defaultSerializer, and only create a new one if a custom serializer is 
passed in. AFAICT a custom serializer is passed only from DiskStore.getValues, 
and that, on the other hand, depends on the serializer passed to 
ExternalSorter. I don't know how often this is used, but I think this can still 
be a good solution for the standard use case.
Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
getValues method is called from a single thread, maybe we can pass that 
directly?

I'd be happy to try a patch but would probably need a confirmation from someone 
that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-15 Thread Hamel Ajay Kothari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278819#comment-14278819
 ] 

Hamel Ajay Kothari commented on SPARK-5097:
---

Am I correct in interpreting that this would allow us to trivially select 
columns at runtime since we'd just use {{SchemaRDD(stringColumnName)}}? In the 
world of catalyst selecting columns known only at runtime was a real pain 
because the only defined way to do it in the docs was to use quasiquotes or use 
{{SchemaRDD.baseLogicalPlan.resolve()}}. The first couldn't be defined at 
runtime (as far as I know) and the second required you to depend on expressions.

Also, is there any way to control the name of the resulting columns from 
groupby+aggregate (or similar methods that add columns) in this plan?

> Adding data frame APIs to SchemaRDD
> ---
>
> Key: SPARK-5097
> URL: https://issues.apache.org/jira/browse/SPARK-5097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf
>
>
> SchemaRDD, through its DSL, already provides common data frame 
> functionalities. However, the DSL was originally created for constructing 
> test cases without much end-user usability and API stability consideration. 
> This design doc proposes a set of API changes for Scala and Python to make 
> the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-15 Thread Steve Brewin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Brewin updated SPARK-5267:

Description: 
The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which supports many additional input protocols. Our tried and 
tested implementation of this proposal is "spark-streaming-camel". 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?




  was:
The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which support many more input protocols. Our tried and tested 
implementation of this proposal is "spark-streaming-camel". 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?





> Add a streaming module to ingest Apache Camel Messages from a configured 
> endpoints
> --
>
> Key: SPARK-5267
> URL: https://issues.apache.org/jira/browse/SPARK-5267
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Steve Brewin
>  Labels: features
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> The number of input stream protocols supported by Spark Streaming is quite 
> limited, which constrains the number of systems with which it can be 
> integrated.
> This proposal solves the problem by adding an optional module that integrates 
> Apache Camel, which supports many additional input protocols. Our tried and 
> tested implementation of this proposal is "spark-streaming-camel". 
> An Apache Camel service is run on a separate Thread, consuming each 
> http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
>  and storing it into Spark's memory. The provider of the Message is specified 
> by any consuming component URI documented at 
> http://camel.apache.org/components.html, making all of these protocols 
> available to Spark Streaming.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278947#comment-14278947
 ] 

Travis Galoppo commented on SPARK-5012:
---

This will probably be affected by SPARK-5019


> Python API for Gaussian Mixture Model
> -
>
> Key: SPARK-5012
> URL: https://issues.apache.org/jira/browse/SPARK-5012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Meethu Mathew
>
> Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)

Al M created SPARK-5270:
---

 Summary: Elegantly check if RDD is empty
 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial


Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams.  Sometimes my batches are huge 
in one stream, sometimes i get nothing for hours.  Still I have to run count() 
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-5270:

Description: 
Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

I'd like a method rdd.isEmpty that returns a boolean.

This would be especially useful when using streams.  Sometimes my batches are 
huge in one stream, sometimes I get nothing for hours.  Still I have to run 
count() to check if there is anything in the RDD.  I can process my empty RDD 
like the others but it would be more efficient to just skip the empty ones.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.



  was:
Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams.  Sometimes my batches are huge 
in one stream, sometimes i get nothing for hours.  Still I have to run count() 
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.


> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278983#comment-14278983
 ] 

Al M commented on SPARK-5270:
-

I just noticed that rdd.partitions.size is set to 0 for empty RDDs and > 0 for 
RDDs with data; this is a far more elegant check than the others.

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278993#comment-14278993
 ] 

Sean Owen commented on SPARK-5270:
--

I think it's conceivable to have an RDD with no elements but nonzero partitions 
though. Witness:

{code}
val empty = sc.parallelize(Array[Int]())
empty.count
...
0
empty.partitions.size
...
8
{code}

> Elegantly check if RDD is empty
> ---
>
> Key: SPARK-5270
> URL: https://issues.apache.org/jira/browse/SPARK-5270
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.2.0
> Environment: Centos 6
>Reporter: Al M
>Priority: Trivial
>
> Right now there is no clean way to check if an RDD is empty.  As discussed 
> here: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
> I'd like a method rdd.isEmpty that returns a boolean.
> This would be especially useful when using streams.  Sometimes my batches are 
> huge in one stream, sometimes I get nothing for hours.  Still I have to run 
> count() to check if there is anything in the RDD.  I can process my empty RDD 
> like the others but it would be more efficient to just skip the empty ones.
> I can also run first() and catch the exception; this is neither a clean nor 
> fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path

2015-01-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279040#comment-14279040
 ] 

Marcelo Vanzin commented on SPARK-5185:
---

BTW I talked to Uri offline about this. The cause is that {{sc._jvm.blah}} 
seems to use the system class loader to load "blah", and {{--jars}} adds things 
to the application class loader instantiated by SparkSubmit. e.g., this works:

{code}
sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass("com.cloudera.science.throwaway.ThrowAway").newInstance()
{code}

That being said, I'm not sure what's the expectation here. {{_jvm}}, starting 
with an underscore, gives me the impression that it's not really supposed to be 
a public API.

> pyspark --jars does not add classes to driver class path
> 
>
> Key: SPARK-5185
> URL: https://issues.apache.org/jira/browse/SPARK-5185
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Uri Laserson
>
> I have some random class I want access to from an Spark shell, say 
> {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
> example I used here:
> https://gist.github.com/laserson/e9e3bd265e1c7a896652
> I packaged it as {{throwaway.jar}}.
> If I then run {{bin/spark-shell}} like so:
> {code}
> bin/spark-shell --master local[1] --jars throwaway.jar
> {code}
> I can execute
> {code}
> val a = new com.cloudera.science.throwaway.ThrowAway()
> {code}
> Successfully.
> I now run PySpark like so:
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar
> {code}
> which gives me an error when I try to instantiate the class through Py4J:
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __getattr__(self, name)
> 724 def __getattr__(self, name):
> 725 if name == '__call__':
> --> 726 raise Py4JError('Trying to call a package.')
> 727 new_fqn = self._fqn + '.' + name
> 728 command = REFLECTION_COMMAND_NAME +\
> Py4JError: Trying to call a package.
> {code}
> However, if I explicitly add the {{--driver-class-path}} to add the same jar
> {code}
> PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
> throwaway.jar --driver-class-path throwaway.jar
> {code}
> it works
> {code}
> In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
> Out[1]: JavaObject id=o18
> {code}
> However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279078#comment-14279078
 ] 

Reynold Xin commented on SPARK-5097:


[~hkothari] that is correct. It will be trivially doable to select columns at 
runtime.

For the 2nd one, not yet. That's a very good point. You can always do an extra 
projection. We will try to add it, if not in the 1st iteration, then in the 2nd 
iteration.

> Adding data frame APIs to SchemaRDD
> ---
>
> Key: SPARK-5097
> URL: https://issues.apache.org/jira/browse/SPARK-5097
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf
>
>
> SchemaRDD, through its DSL, already provides common data frame 
> functionalities. However, the DSL was originally created for constructing 
> test cases without much end-user usability and API stability consideration. 
> This design doc proposes a set of API changes for Scala and Python to make 
> the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5271) PySpark History Web UI issues

2015-01-15 Thread Andrey Zimovnov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zimovnov updated SPARK-5271:
---
Component/s: Web UI

> PySpark History Web UI issues
> -
>
> Key: SPARK-5271
> URL: https://issues.apache.org/jira/browse/SPARK-5271
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
> Environment: PySpark 1.2.0 in yarn-client mode
>Reporter: Andrey Zimovnov
>
> After successful run of PySpark app via spark-submit in yarn-client mode on 
> Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5271) PySpark History Web UI issues

2015-01-15 Thread Andrey Zimovnov (JIRA)

Andrey Zimovnov created SPARK-5271:
--

 Summary: PySpark History Web UI issues
 Key: SPARK-5271
 URL: https://issues.apache.org/jira/browse/SPARK-5271
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: PySpark 1.2.0 in yarn-client mode
Reporter: Andrey Zimovnov


After successful run of PySpark app via spark-submit in yarn-client mode on 
Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5268) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Nan Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-5268:
---
Priority: Blocker  (was: Major)

> CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent
> 
>
> Key: SPARK-5268
> URL: https://issues.apache.org/jira/browse/SPARK-5268
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Nan Zhu
>Priority: Blocker
>
> In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
> backend actor and exit the program upon receive such event...
> let's consider the following case
> The user may develop an Akka-based program which starts the actor with 
> Spark's actor system and communicate with an external actor system (e.g. an 
> Akka-based receiver in spark streaming which communicates with an external 
> system)  If the external actor system fails or disassociates with the actor 
> within spark's system with purpose, we may receive DisassociatedEvent and the 
> executor is restarted.
> This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-15 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279170#comment-14279170
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts) < MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But 
I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-15 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279170#comment-14279170
 ] 

Muhammad-Ali A'rabi edited comment on SPARK-5226 at 1/15/15 7:33 PM:
-

This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts) < MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(‌n)). 
But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.


was (Author: angellandros):
This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts) < MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') >= MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But 
I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5224.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 4024
[https://github.com/apache/spark/pull/4024]

> parallelize list/ndarray is really slow
> ---
>
> Key: SPARK-5224
> URL: https://issues.apache.org/jira/browse/SPARK-5224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0, 1.2.1
>
>
> After the default batchSize changed to 0 (batched based on the size of 
> object), but parallelize() still use BatchedSerializer with batchSize=1.
> Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5224:
--
Assignee: Davies Liu

> parallelize list/ndarray is really slow
> ---
>
> Key: SPARK-5224
> URL: https://issues.apache.org/jira/browse/SPARK-5224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.3.0, 1.2.1
>
>
> After the default batchSize changed to 0 (batched based on the size of 
> object), but parallelize() still use BatchedSerializer with batchSize=1.
> Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279216#comment-14279216
 ] 

Apache Spark commented on SPARK-5111:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/4064

> HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
> ---
>
> Key: SPARK-5111
> URL: https://issues.apache.org/jira/browse/SPARK-5111
> Project: Spark
>  Issue Type: Bug
>Reporter: Zhan Zhang
>
> Due to "java.lang.NoSuchFieldError: SASL_PROPS" error. Need to backport some 
> hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
> support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5272:


 Summary: Refactor NaiveBayes to support discrete and continuous 
labels,features
 Key: SPARK-5272
 URL: https://issues.apache.org/jira/browse/SPARK-5272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


This JIRA is to discuss refactoring NaiveBayes in order to support both 
discrete and continuous labels and features.

Currently, NaiveBayes supports only discrete labels and features.

Proposal: Generalize it to support continuous values as well.

Some items to discuss are:
* How commonly are continuous labels/features used in practice?  (Is this 
necessary?)
* What should the API look like?
** E.g., should NB have multiple classes for each type of label/feature, or 
should it take a general Factor type parameter?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5273) Improve documentation examples for LinearRegression

2015-01-15 Thread Dev Lakhani (JIRA)

Dev Lakhani created SPARK-5273:
--

 Summary: Improve documentation examples for LinearRegression 
 Key: SPARK-5273
 URL: https://issues.apache.org/jira/browse/SPARK-5273
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Dev Lakhani
Priority: Minor


In the document:
https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html

Under
Linear least squares, Lasso, and ridge regression

The suggested method to use LinearRegressionWithSGD.train()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

is not ideal even for simple examples such as y=x. This should be replaced with 
more real world parameters with step size:

val lr = new LinearRegressionWithSGD()
lr.optimizer.setStepSize(0.0001)
lr.optimizer.setNumIterations(100)

or

LinearRegressionWithSGD.train(input,100,0.0001)

To create a reasonable MSE. It took me a while using the dev forum to learn 
that the step size should be really small. Might help save someone the same 
effort when learning mllib.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279235#comment-14279235
 ] 

Joseph K. Bradley commented on SPARK-5272:
--

My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important.  People use 
Logistic Regression with continuous labels and features, and Naive Bayes is 
really the same type of model (just trained differently).
* E.g.: Ng & Jordan. "On Discriminative vs. Generative classifiers: A 
comparison of logistic regression and naive Bayes."  NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should 
be used in different regimes.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which 
use the same underlying implementation.  That implementation should include a 
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if 
we move to support more general probabilistic graphical models in MLlib.

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279235#comment-14279235
 ] 

Joseph K. Bradley edited comment on SPARK-5272 at 1/15/15 8:13 PM:
---

My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important.  People use 
Logistic Regression with continuous labels and features, and Naive Bayes is 
really the same type of model (just trained differently).
* E.g.: Ng & Jordan. "On Discriminative vs. Generative classifiers: A 
comparison of logistic regression and naive Bayes."  NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should 
be used in different regimes.

In terms of when NB is actually used by Spark users, I'm not sure.  Hopefully 
some research and discussion here will make that clearer.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which 
use the same underlying implementation.  That implementation should include a 
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if 
we move to support more general probabilistic graphical models in MLlib.


was (Author: josephkb):
My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important.  People use 
Logistic Regression with continuous labels and features, and Naive Bayes is 
really the same type of model (just trained differently).
* E.g.: Ng & Jordan. "On Discriminative vs. Generative classifiers: A 
comparison of logistic regression and naive Bayes."  NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should 
be used in different regimes.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which 
use the same underlying implementation.  That implementation should include a 
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if 
we move to support more general probabilistic graphical models in MLlib.

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279241#comment-14279241
 ] 

Joseph K. Bradley commented on SPARK-4894:
--

[~rnowling]  I too don't want to hold up the Bernoulli NB too much.  I just 
made & linked a JIRA per your suggestion 
[https://issues.apache.org/jira/browse/SPARK-5272].  I'll add my thoughts there 
(and feel free to copy yours there too).

I'm not sure if we can reuse much from decision trees since they are not 
probabilistic models and have a different concept of "loss" or "error."

For now, generalizing the existing Naive Bayes class to handle the Bernoulli 
case sounds good.  Thanks for taking the time to discuss this!

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-15 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279250#comment-14279250
 ] 

RJ Nowling commented on SPARK-4894:
---

Thanks, [~josephkb]!  I'd be happy to help with the NB refactoring too :) 

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279251#comment-14279251
 ] 

Joseph K. Bradley commented on SPARK-5012:
--

[~MeethuMathew], [~tgaloppo] makes a good point.  It might actually be best to 
make a Python API for MultivariateGaussian first, and then to do this JIRA.  
(Since we don't want to require scipy currently, we can't use the existing 
scipy.stats.multivariate_normal class.)

> Python API for Gaussian Mixture Model
> -
>
> Key: SPARK-5012
> URL: https://issues.apache.org/jira/browse/SPARK-5012
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Meethu Mathew
>
> Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279258#comment-14279258
 ] 

RJ Nowling commented on SPARK-5272:
---

Hi [~josephkb], 

I can see benefits to your suggestions of feature types (e.g., categorial, 
discrete counts, continuous, binary, etc.).  If we created corresponding 
FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it 
would promote composition which would be easier to test, debug, and maintain 
versus multiple NB subclasses like sklearn.  Additionally, if the user can 
define a type for each feature, then users can mix and match likelihood types 
as well.  Most NB implementations treat all features the same -- what if we had 
a model that allowed heterozygous features?  If it works well in NB, it could 
be extended to other parts of MLlib.  (There is likely some overlap with 
decision trees since they support multiple feature types, so we might want to 
see if there is anything there we can reuse.)  At the API level, we could 
provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like 
the current API so that simplicity isn't compromised and provide a more 
advanced API for power users.

Does this sound like I'm understanding you correctly?

Re: Decision trees.  Decision tree models generally support different types of 
features (categorical, binary, discrete, continuous).  Does Spark's decision 
tree implementation support those different types?  How are they handled?  Do 
they abstract the feature type?  I feel there could be common ground here.


> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279269#comment-14279269
 ] 

Joseph K. Bradley commented on SPARK-5272:
--

I like the idea of supporting multiple feature types; I think it should be 
doable, though we'll have to figure out a simple way to specify which features 
are what type.  Decision trees support 2 types: categorical (which includes 
binary and unordered discrete values) and continuous (which includes ordered 
discrete values).  In DecisionTree, you specify categoricalFeaturesInfo which 
says which features are categorical + their arity, but I hope this can become 
part of the SchemaRDD metadata before long.

I think we can take ideas from the DecisionTree API, just not much from the 
underlying implementation.

> Refactor NaiveBayes to support discrete and continuous labels,features
> --
>
> Key: SPARK-5272
> URL: https://issues.apache.org/jira/browse/SPARK-5272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279274#comment-14279274
 ] 

Joseph K. Bradley commented on SPARK-1405:
--

I'll try out the statmt dataset if that will be easier for everyone to access.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279274#comment-14279274
 ] 

Joseph K. Bradley edited comment on SPARK-1405 at 1/15/15 9:29 PM:
---

I'll try out the statmt dataset if that will be easier for everyone to access.

UPDATE: Note: The statmt dataset is an odd one since each "document" is a 
single sentence.  I'll still try it since I could imagine a lot of users 
wanting to run LDA on tweets or other short documents, but I might continue 
with my previous tests first.


was (Author: josephkb):
I'll try out the statmt dataset if that will be easier for everyone to access.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5274) Stabilize UDFRegistration API

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279352#comment-14279352
 ] 

Apache Spark commented on SPARK-5274:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4056

> Stabilize UDFRegistration API
> -
>
> Key: SPARK-5274
> URL: https://issues.apache.org/jira/browse/SPARK-5274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> 1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
> ("udf"). This removes 45 methods from SQLContext.
> 2. For Java UDFs, renamed dataType to returnType.
> 3. For Scala UDFs, added type tags.
> 4. Added all Java UDF registration methods to Scala's UDFRegistration.
> 5. Better documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5274) Stabilize UDFRegistration API

2015-01-15 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-5274:
--

 Summary: Stabilize UDFRegistration API
 Key: SPARK-5274
 URL: https://issues.apache.org/jira/browse/SPARK-5274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
("udf"). This removes 45 methods from SQLContext.
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Better documentation




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-15 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279406#comment-14279406
 ] 

Josh Rosen commented on SPARK-4879:
---

I'm not sure that SparkHadoopWriter's use of FileOutputCommitter properly obeys 
the OutputCommitter contracts in Hadoop.  According to the [OutputCommitter 
Javadoc|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/OutputCommitter.html]

{quote}
The methods in this class can be called from several different processes and 
from several different contexts. It is important to know which process and 
which context each is called from. Each method should be marked accordingly in 
its documentation. It is also important to note that not all methods are 
guaranteed to be called once and only once. If a method is not guaranteed to 
have this property the output committer needs to handle this appropriately. 
Also note it will only be in rare situations where they may be called multiple 
times for the same task.
{quote}

Based on the documentation, `needsTaskCommit` " is called from each individual 
task's process that will output to HDFS, and it is called just for that task.", 
so it seems like it should be safe to call this from SparkHadoopWriter.

However, maybe we're misusing the `commitTask` method:

{quote}
If needsTaskCommit(TaskAttemptContext) returns true and this task is the task 
that the AM determines finished first, this method is called to commit an 
individual task's output. This is to mark that tasks output as complete, as 
commitJob(JobContext) will also be called later on if the entire job finished 
successfully. This is called from a task's process. This may be called multiple 
times for the same task, but different task attempts. It should be very rare 
for this to be called multiple times and requires odd networking failures to 
make this happen. In the future the Hadoop framework may eliminate this race. 
{quote}

I think that we're missing the "this task is the task that the AM determines 
finished first" part of the equation here.  If `needsTaskCommit` is false, then 
we definitely shouldn't commit (e.g. if it's an original task that lost to a 
speculated copy), but if it's true then I don't think it's safe to commit; we 
need some central authority to pick a winner.

Let's see how Hadoop does things, working backwards from actual calls of 
`commitTask` to see whether they're guarded by some coordination through the 
AM.  It looks like `OutputCommitter` is part of the `mapred` API, so I'll only 
look at classes in that package:

In `Task.java`, `committer.commitTask` is only performed after checking 
`canCommit` through `TaskUmbilicalProtocol`: 
https://github.com/apache/hadoop/blob/a655973e781caf662b360c96e0fa3f5a873cf676/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1185.
  According to the Javadocs for TaskAttemptListenerImpl.canCommit (the actual 
concrete implementation of this method):

{code}
  /**
   * Child checking whether it can commit.
   * 
   * 
   * Commit is a two-phased protocol. First the attempt informs the
   * ApplicationMaster that it is
   * {@link #commitPending(TaskAttemptID, TaskStatus)}. Then it repeatedly polls
   * the ApplicationMaster whether it {@link #canCommit(TaskAttemptID)} This is
   * a legacy from the centralized commit protocol handling by the JobTracker.
   */
  @Override
  public boolean canCommit(TaskAttemptID taskAttemptID) throws IOException {
{code}

This ends up delegating to `Task.canCommit()`:

{code}
  /**
   * Can the output of the taskAttempt be committed. Note that once the task
   * gives a go for a commit, further canCommit requests from any other attempts
   * should return false.
   * 
   * @param taskAttemptID
   * @return whether the attempt's output can be committed or not.
   */
  boolean canCommit(TaskAttemptId taskAttemptID);
{code}

There's a bunch of tricky logic that involves communication with the AM (see 
AttemptCommitPendingTransition and the other transitions in TaskImpl), but it 
looks like the gist is that the "winner" is picked by the AM through some 
central coordination process. 

So, it looks like the right fix is to implement these same state transitions 
ourselves.  It would be nice if there was a clean way to do this that could be 
easily backported to maintenance branches.  

> Missing output partitions after job completes with speculative execution
> 
>
> Key: SPARK-4879
> URL: https://issues.apache.org/jira/browse/SPARK-4879
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critic

[jira] [Commented] (SPARK-5144) spark-yarn module should be published

2015-01-15 Thread Matthew Sanders (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279457#comment-14279457
 ] 

Matthew Sanders commented on SPARK-5144:


+1 -- I am in a similar situation and would love to see this addressed somehow. 

> spark-yarn module should be published
> -
>
> Key: SPARK-5144
> URL: https://issues.apache.org/jira/browse/SPARK-5144
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Aniket Bhatnagar
>
> We disabled publishing of certain modules in SPARK-3452. One of such modules 
> is spark-yarn. This breaks applications that submit spark jobs 
> programatically with master set as yarn-client. This is because SparkContext 
> is dependent on classes from yarn-client module to submit the YARN 
> application. 
> Here is the stack trace that you get if you submit the spark job without 
> yarn-client dependency:
> 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - 
> MemoryStore started with capacity 731.7 MB
> Exception in thread "pool-10-thread-13" java.lang.ExceptionInInitializerError
> at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
> at org.apache.spark.storage.BlockManager.(BlockManager.scala:105)
> at org.apache.spark.storage.BlockManager.(BlockManager.scala:180)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
> at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
> at org.apache.spark.SparkContext.(SparkContext.scala:232)
> at com.myimpl.Server:23)
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
> at scala.util.Try$.apply(Try.scala:191)
> at scala.util.Success.map(Try.scala:236)
> at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
> at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
> at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
> at scala.util.Try$.apply(Try.scala:191)
> at scala.util.Success.map(Try.scala:236)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unable to load YARN support
> at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
> at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:194)
> at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
> ... 27 more
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:190)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
> ... 29 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2015-01-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279524#comment-14279524
 ] 

Imran Rashid commented on SPARK-4746:
-

This doesn't work as well as I thought -- all of the junit tests get skipped.  
The problem is a mismatch between the way test args are handled by the junit 
test runner and the scalatest runner.

I think our options are:

1) abandon a tag-based approach: just use directories / file names to separate 
out unit tests & integration tests

2) change all of our junit tests to scalatest.  (its perfectly fine to test 
java code w/ scalatest.)

3) See if we can get scalatest to also run our junit tests

4) change the sbt task to first run scalatest, with all junit tests turned off, 
and then just run the junit tests, so that we can pass in different args to 
each one.

5) just live w/ the fact that the junit tests never match the tags so they are 
effectively considered integration tests.

Note that junit has a notion similar to tags in categories: 
https://github.com/junit-team/junit/wiki/Categories
The main problem here is the difference in the args for the two test runners.

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5275) pyspark.streaming is not included in assembly jar

2015-01-15 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5275:
-

 Summary: pyspark.streaming is not included in assembly jar
 Key: SPARK-5275
 URL: https://issues.apache.org/jira/browse/SPARK-5275
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker


The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5276) pyspark.streaming is not included in assembly jar

2015-01-15 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5276:
-

 Summary: pyspark.streaming is not included in assembly jar
 Key: SPARK-5276
 URL: https://issues.apache.org/jira/browse/SPARK-5276
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker


The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5274) Stabilize UDFRegistration API

2015-01-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5274.

   Resolution: Fixed
Fix Version/s: 1.3.0

> Stabilize UDFRegistration API
> -
>
> Key: SPARK-5274
> URL: https://issues.apache.org/jira/browse/SPARK-5274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.0
>
>
> 1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
> ("udf"). This removes 45 methods from SQLContext.
> 2. For Java UDFs, renamed dataType to returnType.
> 3. For Scala UDFs, added type tags.
> 4. Added all Java UDF registration methods to Scala's UDFRegistration.
> 5. Better documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2015-01-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279602#comment-14279602
 ] 

Imran Rashid commented on SPARK-3622:
-

In some ways this kinda reminds of the problem w/ accumulators and lazy 
transformations.  Accumulators are basically multiple output, but Spark itself 
provides no way to track when that output is ready.  Its up to the developer to 
figure it out.

If you do a transformation on {{rddA}} you've got to know to "wait" until 
you've also got a transformation on {{rddB}} ready as well.  Probably the 
simplest case for this is filtering records by some condition, but keeping both 
the good and bad records, ala scala collection's {{partition}} method.  I think 
this has come up on the user mailing list a few times.

What about having some new type {{MultiRDD}}, which only runs when you've 
queued up an action on *all* RDDs?  eg. something like:

{code}
val input: RDD[String] = ...
val goodAndBad: MultiRdd[String, String] = input.partition{ str => 
MyRecordParser.isOk(str)}
val bad: RDD[String] = goodAndBad.get(1)
bad.saveAsTextFile(...) // doesn't do anything yet
val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str => 
MyRecordParser.parse(str)}
val tmp: RDD[MyCaseClass] = parsed.map{f1}.filter{f2}.mapPartitions{f3} //still 
don't do anything ...
val result = tmp.reduce{reduceFunc} // now everything gets run
{code}

> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-01-15 Thread Max Seiden (JIRA)

Max Seiden created SPARK-5277:
-

 Summary: SparkSqlSerializer does not register user specified 
KryoRegistrators 
 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Max Seiden


Although the SparkSqlSerializer class extends the KryoSerializer in core, it's 
overridden newKryo() does not call super.newKryo(). This results in 
inconsistent serializer behaviors depending on whether a KryoSerializer 
instance or a SparkSqlSerializer instance is used. This may also be related to 
the TODO in KryoResourcePool, which uses KryoSerializer instead of 
SparkSqlSerializer due to yet-to-be-investigated test failures.

An example of the divergence in behavior: The Exchange operator creates a new 
SparkSqlSerializer instance (with an empty conf; another issue) when it is 
constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
resource pool (see above). The result is that the serialized in-memory columns 
are created using the user provided serializers / registrators, while 
serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-01-15 Thread Max Seiden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Seiden updated SPARK-5277:
--
Remaining Estimate: (was: 24h)
 Original Estimate: (was: 24h)

> SparkSqlSerializer does not register user specified KryoRegistrators 
> -
>
> Key: SPARK-5277
> URL: https://issues.apache.org/jira/browse/SPARK-5277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Max Seiden
>
> Although the SparkSqlSerializer class extends the KryoSerializer in core, 
> it's overridden newKryo() does not call super.newKryo(). This results in 
> inconsistent serializer behaviors depending on whether a KryoSerializer 
> instance or a SparkSqlSerializer instance is used. This may also be related 
> to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
> SparkSqlSerializer due to yet-to-be-investigated test failures.
> An example of the divergence in behavior: The Exchange operator creates a new 
> SparkSqlSerializer instance (with an empty conf; another issue) when it is 
> constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
> resource pool (see above). The result is that the serialized in-memory 
> columns are created using the user provided serializers / registrators, while 
> serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279648#comment-14279648
 ] 

Apache Spark commented on SPARK-5193:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4065

> Make Spark SQL API usable in Java and remove the Java-specific API
> --
>
> Key: SPARK-5193
> URL: https://issues.apache.org/jira/browse/SPARK-5193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Java version of the SchemaRDD API causes high maintenance burden for Spark 
> SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
> both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
> usable for Java, and then we can remove the Java specific version. 
> Things to remove include (Java version of):
> - data type
> - Row
> - SQLContext
> - HiveContext
> Things to consider:
> - Scala and Java have a different collection library.
> - Scala and Java (8) have different closure interface.
> - Scala and Java can have duplicate definitions of common classes, such as 
> BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 131 matches

Mail list logo