[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41766159
  
@techaddict It looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41765699
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14582/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41765698
  
Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/30


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread techaddict
Github user techaddict commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41765491
  
@mengxr @47ef86c392badc58052a0414115e49c2970b31eb looks good ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist Hive Tests

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/596#issuecomment-41765479
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist Hive Tests

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/596#issuecomment-41765473
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Whitelist Hive Tests

2014-04-29 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/596

[SQL] Whitelist Hive Tests

[WIP] Includes #595.  Just putting it here for Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark moreTests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/596.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #596


commit efa64024e2d8b3fad23f9773e2cbff88d9884048
Author: Michael Armbrust 
Date:   2014-04-30T01:34:08Z

Fix Hive serde_regex test.

commit a4dc6123e0d2bf030815c9a0528ac294141b8a24
Author: Michael Armbrust 
Date:   2014-04-30T01:34:56Z

Add files created by hive to gitignore.

commit 8ba8bc185327436a5ec214f0efd128c97ba25940
Author: Michael Armbrust 
Date:   2014-04-30T05:56:28Z

update whitelist

commit 1943ade74a760267e52d3990cb8f95d482af4c46
Author: Michael Armbrust 
Date:   2014-04-30T06:37:48Z

More hive gitignore

commit 8649e4467e459bc174e062ea54cbb37a5b840e4f
Author: Michael Armbrust 
Date:   2014-04-30T06:38:32Z

Add hive golden answers.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41765289
  
I would say using `org.apache.spark.examples` for now, which requires 
little code change. Then we try to avoid using `private[spark]` in examples. 
Maybe there is an automatic way to detect it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12130313
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -64,7 +70,8 @@ private[spark] class EventLoggingListener(
   def start() {
 logInfo("Logging events to %s".format(logDir))
 if (shouldCompress) {
-  val codec = conf.get("spark.io.compression.codec", 
CompressionCodec.DEFAULT_COMPRESSION_CODEC)
--- End diff --

Ah I see - I misread this to only be a linebreak change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41764996
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12130230
  
--- Diff: core/src/test/scala/org/apache/spark/util/FileLoggerSuite.scala 
---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import java.io.IOException
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.hadoop.fs.Path
+import org.scalatest.{BeforeAndAfter, FunSuite}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.io.CompressionCodec
+
+/**
+ * Test writing files through the FileLogger.
+ */
+class FileLoggerSuite extends FunSuite with BeforeAndAfter {
+  private val fileSystem = Utils.getHadoopFileSystem("/")
+  private val allCompressionCodecs = Seq[String](
+"org.apache.spark.io.LZFCompressionCodec",
+"org.apache.spark.io.SnappyCompressionCodec"
+  )
+  private val logDir = "/tmp/test-file-logger"
+  private val logDirPath = new Path(logDir)
+
+  after {
+Try { fileSystem.delete(logDirPath, true) }
+  }
+
+  test("Simple logging") {
+testSingleFile()
+  }
+
+  test ("Simple logging with compression") {
+allCompressionCodecs.foreach { codec =>
+  testSingleFile(Some(codec))
+}
+  }
+
+  test("Logging multiple files") {
+testMultipleFiles()
+  }
+
+  test("Logging multiple files with compression") {
+allCompressionCodecs.foreach { codec =>
+  testMultipleFiles(Some(codec))
+}
+  }
+
+  test("Logging when directory already exists") {
+// Create the logging directory multiple times
+new FileLogger(logDir, new SparkConf, overwrite = true)
+new FileLogger(logDir, new SparkConf, overwrite = true)
+new FileLogger(logDir, new SparkConf, overwrite = true)
+
+// If overwrite is not enabled, an exception should be thrown
+intercept[IOException] { new FileLogger(logDir, new SparkConf, 
overwrite = false) }
+  }
+
+
+  /* - *
+   * Actual test logic *
+   * - */
+
+  /**
+   * Test logging to a single file.
+   */
+  private def testSingleFile(codecName: Option[String] = None) {
+val conf = getLoggingConf(codecName)
+val codec = codecName.map { c => CompressionCodec.createCodec(conf) }
+val logger =
+  if (codecName.isDefined) {
+new FileLogger(logDir, conf, compress = true)
+  } else {
+new FileLogger(logDir, conf)
+  }
+assert(fileSystem.exists(logDirPath))
+assert(fileSystem.getFileStatus(logDirPath).isDir)
+assert(fileSystem.listStatus(logDirPath).size === 0)
+
+logger.newFile()
+val files = fileSystem.listStatus(logDirPath)
+assert(files.size === 1)
+val firstFile = files.head
+val firstFilePath = firstFile.getPath
+
+logger.log("hello")
+logger.flush()
+assert(readFileContent(firstFilePath, codec) === "hello")
+
+logger.log(" world")
+logger.close()
+assert(readFileContent(firstFilePath, codec) === "hello world")
+  }
+
+  /**
+   * Test logging to multiple files.
+   */
+  private def testMultipleFiles(codecName: Option[String] = None) {
+val conf = getLoggingConf(codecName)
+val codec = codecName.map { c => CompressionCodec.createCodec(conf) }
+val logger =
+  if (codecName.isDefined) {
+new FileLogger(logDir, conf, compress = true)
+  } else {
+new FileLogger(logDir, conf)
+  }
+
+logger.newFile("Jean_Valjean")
+logger.logLine("Who am I?")
+logger.logLine("Destiny?")
+logger.newFile("John_Valjohn")
--- End diff --

Oops, my bad. It's hard to get this one right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub

[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41764997
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14581/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41764738
  
Also, we should definitely document how to set up PySpark on YARN, so the 
user doesn't have to jump through hoops to get a simple job running. The 
biggest thing is probably emphasize that it only works if we build with maven. 
Maybe we should also have a section that explains what to do when you run into 
the unhelpful `java.io.EOFException`. Or better still, throw a nicer 
exception message that prints out the PYTHONPATH and complains that it can't 
find pyspark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41764569
  
Thanks - merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/30#discussion_r12129988
  
--- Diff: sbin/spark-config.sh ---
@@ -34,3 +34,6 @@ this="$config_bin/$script"
 export SPARK_PREFIX=`dirname "$this"`/..
 export SPARK_HOME=${SPARK_PREFIX}
 export SPARK_CONF_DIR="$SPARK_HOME/conf"
+# Add the PySpark classes to the PYTHONPATH:
+export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
--- End diff --

Looks like we never addressed this. Should we move this into 
`spark-submit`, now that we have that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41764337
  
Just confirmed that this works on a CDH cluster. This should be ready for 
merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1556: bump jets3t version to 0.9.0

2014-04-29 Thread darose
Github user darose commented on the pull request:

https://github.com/apache/spark/pull/468#issuecomment-41764125
  
So @srowen, I think @mateiz is right, the CDH5 spark-core package (on 
Ubuntu, it's version 0.9.0+cdh5.0.0+31-1.cdh5.0.0.p0.31~precise-cdh5.0.0) won't 
function correctly due to this issue and so would need to get rebuilt against 
jets3t 0.9.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12129800
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -64,7 +70,8 @@ private[spark] class EventLoggingListener(
   def start() {
 logInfo("Logging events to %s".format(logDir))
 if (shouldCompress) {
-  val codec = conf.get("spark.io.compression.codec", 
CompressionCodec.DEFAULT_COMPRESSION_CODEC)
--- End diff --

I renamed it to sparkConf, because there's also a hadoopConf now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/591#issuecomment-41763679
  
Unfortunately I only had time to do a cursory glance. I mostly just sanity 
checked the changes to the non-test code to make sure there were no mistakes. 
Looks good to me pending some small comments about the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12129684
  
--- Diff: core/src/test/scala/org/apache/spark/util/FileLoggerSuite.scala 
---
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+import java.io.IOException
+
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.hadoop.fs.Path
+import org.scalatest.{BeforeAndAfter, FunSuite}
+
+import org.apache.spark.SparkConf
+import org.apache.spark.io.CompressionCodec
+
+/**
+ * Test writing files through the FileLogger.
+ */
+class FileLoggerSuite extends FunSuite with BeforeAndAfter {
+  private val fileSystem = Utils.getHadoopFileSystem("/")
+  private val allCompressionCodecs = Seq[String](
+"org.apache.spark.io.LZFCompressionCodec",
+"org.apache.spark.io.SnappyCompressionCodec"
+  )
+  private val logDir = "/tmp/test-file-logger"
+  private val logDirPath = new Path(logDir)
+
+  after {
+Try { fileSystem.delete(logDirPath, true) }
+  }
+
+  test("Simple logging") {
+testSingleFile()
+  }
+
+  test ("Simple logging with compression") {
+allCompressionCodecs.foreach { codec =>
+  testSingleFile(Some(codec))
+}
+  }
+
+  test("Logging multiple files") {
+testMultipleFiles()
+  }
+
+  test("Logging multiple files with compression") {
+allCompressionCodecs.foreach { codec =>
+  testMultipleFiles(Some(codec))
+}
+  }
+
+  test("Logging when directory already exists") {
+// Create the logging directory multiple times
+new FileLogger(logDir, new SparkConf, overwrite = true)
+new FileLogger(logDir, new SparkConf, overwrite = true)
+new FileLogger(logDir, new SparkConf, overwrite = true)
+
+// If overwrite is not enabled, an exception should be thrown
+intercept[IOException] { new FileLogger(logDir, new SparkConf, 
overwrite = false) }
+  }
+
+
+  /* - *
+   * Actual test logic *
+   * - */
+
+  /**
+   * Test logging to a single file.
+   */
+  private def testSingleFile(codecName: Option[String] = None) {
+val conf = getLoggingConf(codecName)
+val codec = codecName.map { c => CompressionCodec.createCodec(conf) }
+val logger =
+  if (codecName.isDefined) {
+new FileLogger(logDir, conf, compress = true)
+  } else {
+new FileLogger(logDir, conf)
+  }
+assert(fileSystem.exists(logDirPath))
+assert(fileSystem.getFileStatus(logDirPath).isDir)
+assert(fileSystem.listStatus(logDirPath).size === 0)
+
+logger.newFile()
+val files = fileSystem.listStatus(logDirPath)
+assert(files.size === 1)
+val firstFile = files.head
+val firstFilePath = firstFile.getPath
+
+logger.log("hello")
+logger.flush()
+assert(readFileContent(firstFilePath, codec) === "hello")
+
+logger.log(" world")
+logger.close()
+assert(readFileContent(firstFilePath, codec) === "hello world")
+  }
+
+  /**
+   * Test logging to multiple files.
+   */
+  private def testMultipleFiles(codecName: Option[String] = None) {
+val conf = getLoggingConf(codecName)
+val codec = codecName.map { c => CompressionCodec.createCodec(conf) }
+val logger =
+  if (codecName.isDefined) {
+new FileLogger(logDir, conf, compress = true)
+  } else {
+new FileLogger(logDir, conf)
+  }
+
+logger.newFile("Jean_Valjean")
+logger.logLine("Who am I?")
+logger.logLine("Destiny?")
+logger.newFile("John_Valjohn")
--- End diff --

You spelled this `John_Valjohn` but the correct spelling is `Hugh Jackman`.


---
If your project is set up for it, you can reply to this email and have 

[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12129638
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -0,0 +1,413 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.json4s.jackson.JsonMethods._
+import org.scalatest.{BeforeAndAfter, FunSuite}
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.io.CompressionCodec
+import org.apache.spark.util.{JsonProtocol, Utils}
+
+/**
+ * Test whether EventLoggingListener logs events properly.
+ *
+ * This tests whether EventLoggingListener actually creates special files 
while logging events,
+ * whether the parsing of these special files is correct, and whether the 
logged events can be
+ * read and deserialized into actual SparkListenerEvents.
+ */
+class EventLoggingListenerSuite extends FunSuite with BeforeAndAfter {
+  private val fileSystem = Utils.getHadoopFileSystem("/")
+  private val allCompressionCodecs = Seq[String](
+"org.apache.spark.io.LZFCompressionCodec",
+"org.apache.spark.io.SnappyCompressionCodec"
+  )
+
+  after {
+Try { fileSystem.delete(new Path("/tmp/spark-events"), true) }
--- End diff --

It would be better if you manually specified the log directory for all 
tests related to the comment below. Ideally these tests should run even on an 
OS that doesn't have a `/tmp` directory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12129632
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -0,0 +1,413 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable
+import scala.io.Source
+import scala.util.Try
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.json4s.jackson.JsonMethods._
+import org.scalatest.{BeforeAndAfter, FunSuite}
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.io.CompressionCodec
+import org.apache.spark.util.{JsonProtocol, Utils}
+
+/**
+ * Test whether EventLoggingListener logs events properly.
+ *
+ * This tests whether EventLoggingListener actually creates special files 
while logging events,
+ * whether the parsing of these special files is correct, and whether the 
logged events can be
+ * read and deserialized into actual SparkListenerEvents.
+ */
+class EventLoggingListenerSuite extends FunSuite with BeforeAndAfter {
+  private val fileSystem = Utils.getHadoopFileSystem("/")
+  private val allCompressionCodecs = Seq[String](
+"org.apache.spark.io.LZFCompressionCodec",
+"org.apache.spark.io.SnappyCompressionCodec"
+  )
+
+  after {
+Try { fileSystem.delete(new Path("/tmp/spark-events"), true) }
+Try { fileSystem.delete(new Path("/tmp/spark-foo"), true) }
--- End diff --

You should use a utility for creating a temporary directory in these tests 
rather than hard coding `/tmp`. Some platforms (e.g. windows) won't have this 
directory. Checkout the other test suites, we do this a lot in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/591#discussion_r12129472
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -64,7 +70,8 @@ private[spark] class EventLoggingListener(
   def start() {
 logInfo("Logging events to %s".format(logDir))
 if (shouldCompress) {
-  val codec = conf.get("spark.io.compression.codec", 
CompressionCodec.DEFAULT_COMPRESSION_CODEC)
--- End diff --

I don't think this was over the line limit before (?) wouldn't it have 
failed our style checks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Args for worker rather than master

2014-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/587


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Handle the vals that never used

2014-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/565


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS

2014-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/568


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41762289
  
 Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41762297
  
Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread techaddict
Github user techaddict commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41761434
  
@mengxr what do you suggest how should we resolve the private[spark] 
problem ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Handle the vals that never used

2014-04-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/565#issuecomment-41761240
  
Thanks. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Args for worker rather than master

2014-04-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/587#issuecomment-41761176
  
Thanks. I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41761129
  
@andrewor14 if you could do a final test on this just to double check, I 
think it's good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS

2014-04-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/568#issuecomment-41761125
  
Merged. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41761073
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41761067
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41761064
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41761038
  
@techaddict I don't see anything necessary except the one using 
`private[spark]` or `private[streaming]` methods. For consistency, 
`org.apache.examples` may be better because many examples are already under 
this package name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41759555
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41759556
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14580/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0

2014-04-29 Thread techaddict
Github user techaddict commented on the pull request:

https://github.com/apache/spark/pull/571#issuecomment-41758235
  
@mengxr so what changes should i make other than streaming one suggested by 
tdas.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/591#issuecomment-41758189
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/591#issuecomment-41758190
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14579/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41758004
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41758002
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41757854
  
Updated PR moves unzipping py4j to an earlier phase so that it gets 
included the first time around.  Tested it out and saw it appear in the jar the 
first time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41757036
  
@sryza I also asked @ahirreddy to look into whether we can just publish a 
jar to maven central that contains the py4j python side. Then we can just 
depend on that jar and be done with it and not count on `unzip` being 
installed. He seemed to think it was possible...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41756809
  
Yeah, that's the issue.  Looking into the best way to fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41756737
  
@andrewor14 So you are saying that if I just download this from a blank 
build it won't work... but if I happen to build twice it will work.

I wonder if the issue might be that the maven-exec-plugin isn't guarenteed 
to execute before the packaging of the jar itself.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/591#issuecomment-41756602
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/591#issuecomment-41756609
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread jegonzal
Github user jegonzal commented on the pull request:

https://github.com/apache/spark/pull/476#issuecomment-41755910
  
I would be happy to talk more about this after the OSDI deadline.  As far 
as storing the model (or more precisely the counts and samples) as an a RDD, I 
think this really is necessary.  The model in this case should be on the order 
of the size of the data.  

Essentially what you want is the ability to join the term topic counts with 
the document topic counts for each token in a given document.  Given these two 
counts tables (along with the background distribution of topics in the entire 
corpus) you can compute the new topic assignment.  

Here is an implementation of the collapsed Gibbs sampler for LDA using 
GraphX:  https://github.com/amplab/graphx/pull/113




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/595#issuecomment-41755913
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12127214
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
--- End diff --

You mean "termId: count" ? Yes it is a common way to do that. But I just 
consider the trade-off between statistical efficiency and hardware efficiency. 
If we combine the same term together in one document, it seems that the 
randomness is worse. Anyway, I'll try to modified it using SparseVector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/595#issuecomment-41755914
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14578/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12127051
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
--- End diff --

That's make sense. I think the `docTopicCounts` could be sliced easily 
W.R.T. documents partitions. But for `topicTermCounts`, it's hard to do slice. 
I'll find a way to settle it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/476#issuecomment-41755358
  
Yep, I know @jegonzal for his paper *Parallel Gibbs Sampling*. But I only 
have the idea of the implementation on GraphLab and not find the impl in 
GraphX. It's great if I have the chance to talk with Joseph offline.

Besides, I will add a use case for reuters dataset and try to fix the 
issues put above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Update GradientDescentSuite.scala

2014-04-29 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/588#issuecomment-41755330
  
had modified,   thanks to @techaddict @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...

2014-04-29 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/482#issuecomment-41754201
  
Understood. thank you @marmbrus , I will add more unit tests based on 
current implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/476#issuecomment-41753636
  
Also, speaking of @jegonzal maybe this is a natural first point of 
integration between MLlib and GraphX - I know GraphX has an implementation of 
LDA built in, and maybe this is a chance for us to leverage that work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/476#issuecomment-41753574
  
Before I get too deep into this review - I want to step back and think 
about whether we expect the model in this case to be on the order of the size 
of the data - I think it is, and if so, we may want to consider representing 
the model as RDD[DocumentTopicFeatures] and RDD[TopicWordFeatures], similar to 
what we do with ALS. This may change the algorithm substantially.

Separately, maybe it makes sense to have a concrete use case to work with 
(reuters dataset or something) so that we can evaluate how much memory actually 
gets used given a reasonably sized corpus.

Perhaps @mengxr or @jegonzal has a strong opinion on this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...

2014-04-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/482#issuecomment-41753498
  
I think the problem with the previous implementation is the change is much 
larger and therefore harder to understand. (It took me a while to look at it 
and find bugs, but there were quite a few and I'm sure I didn't find them all.)

A lot of concepts _could_ be added to the expression hierarchy, but the 
goal of Catalyst is to instead allow self contained rules to perform 
optimizations.  By keeping these optimization rules self contained it is much 
easier to look in one place and reason about their correctness.  Making the 
rule simple while pushing all of the complex logic to many different places in 
the code is not the goal here.

The only reason nullability is modeled in the expression hierarchy at all 
is because it is a fairly fundamental concept that is  part of the schema of 
the data.

Regarding HiveGenericUdf, I'm not sure how the three state solution allows 
us to do any extra folding.  The UDF is still a black box and we have no idea 
how its going to respond to null inputs.

Regarding your last point about code generation, we don't need three 
states.  We already know if we need to track a nullable bit when doing code 
generation with the current two state implementation.  If it is possible 
statically determine a value to be null, that will have already been done by 
optimization rules and we will just have a null literal.  All we need is a 
special rule that generates the AST for `Literal(null, _)`.  So, actually I 
think the code generation logic is simpler with this implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41753419
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14577/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41753415
  
Update - with the latest commits from master I am able to run PySpark on 
YARN successfully on both CDH and HDP clusters.

There is still an issue with the maven build, however. The first build 
produces a jar that does not include the `py4j/*.py` files, while the second 
build produces one that does include all needed files. This is because we try 
to include these python files before unzipping `py4j*zip`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41753418
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126271
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -20,15 +20,17 @@ package org.apache.spark.mllib.util
 import scala.reflect.ClassTag
 
 import breeze.linalg.{Vector => BV, SparseVector => BSV, squaredDistance 
=> breezeSquaredDistance}
+import breeze.util.Index
+import chalk.text.tokenize.JavaWordTokenizer
--- End diff --

I think chalk for tokenization is probably a fine choice for starters - but 
can you say what it buys us over regular scala StringTokenizer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126247
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.expectation
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV, sum}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.clustering.{Document, LDAParams}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * Gibbs sampling from a given dataset and org.apache.spark.mllib.model.
+ * @param data Dataset, such as corpus.
+ * @param numOuterIterations Number of outer iteration.
+ * @param numInnerIterations Number of inner iteration, used in each 
partition.
+ * @param docTopicSmoothing Document-topic smoothing.
+ * @param topicTermSmoothing Topic-term smoothing.
+ */
+class GibbsSampling(
+data: RDD[Document],
+numOuterIterations: Int,
+numInnerIterations: Int,
+docTopicSmoothing: Double,
+topicTermSmoothing: Double)
+  extends Logging with Serializable {
+
+  import GibbsSampling._
+
+  /**
+   * Main function of running a Gibbs sampling method. It contains two 
phases of total Gibbs
+   * sampling: first is initialization, second is real sampling.
+   */
+  def runGibbsSampling(
+  initParams: LDAParams,
+  data: RDD[Document] = data,
+  numOuterIterations: Int = numOuterIterations,
+  numInnerIterations: Int = numInnerIterations,
+  docTopicSmoothing: Double = docTopicSmoothing,
+  topicTermSmoothing: Double = topicTermSmoothing): LDAParams = {
+
+val numTerms = initParams.topicTermCounts.head.size
+val numDocs = initParams.docCounts.size
+val numTopics = initParams.topicCounts.size
+
+// Construct topic assignment RDD
+logInfo("Start initialization")
+
+val cpInterval = 
System.getProperty("spark.gibbsSampling.checkPointInterval", "10").toInt
+val sc = data.context
+val (initialParams, initialChosenTopics) = 
sampleTermAssignment(initParams, data)
+
+// Gibbs sampling
+val (params, _, _) = Iterator.iterate((sc.accumulable(initialParams), 
initialChosenTopics, 0)) {
+  case (lastParams, lastChosenTopics, i) =>
+logInfo("Start Gibbs sampling")
+
+val rand = new Random(42 + i * i)
+val params = sc.accumulable(LDAParams(numDocs, numTopics, 
numTerms))
+val chosenTopics = data.zip(lastChosenTopics).map {
+  case (Document(docId, content), topics) =>
+content.zip(topics).map { case (term, topic) =>
+  lastParams += (docId, term, topic, -1)
+
+  val chosenTopic = lastParams.localValue.dropOneDistSampler(
+docTopicSmoothing, topicTermSmoothing, term, docId, rand)
+
+  lastParams += (docId, term, chosenTopic, 1)
+  params += (docId, term, chosenTopic, 1)
+
+  chosenTopic
+}
+}.cache()
+
+if (i + 1 % cpInterval == 0) {
+  chosenTopics.checkpoint()
+}
+
+// Trigger a job to collect accumulable LDA parameters.
+chosenTopics.count()
+lastChosenTopics.unpersist()
+
+(params, chosenTopics, i + 1)
+}.drop(1 + numOuterIterations).next()
+
+params.value
+  }
+
+  /**
+   * Model matrix Phi and Theta are inferred via LDAParams.
+   */
+  def solvePhiAndTheta(
+  params: LDAParams,
+  docTopicSmoothing: Double = docTopicSmoothing,
+  topicTermSmoothing: Double = topicTermSmoothing): (Array[Vector], 
Array[Vector]) = {
--- End diff --

Again, Phi and Theta might be too big.


---
If your project is set up for it, y

[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126191
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.expectation
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV, sum}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.clustering.{Document, LDAParams}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * Gibbs sampling from a given dataset and org.apache.spark.mllib.model.
+ * @param data Dataset, such as corpus.
+ * @param numOuterIterations Number of outer iteration.
+ * @param numInnerIterations Number of inner iteration, used in each 
partition.
+ * @param docTopicSmoothing Document-topic smoothing.
+ * @param topicTermSmoothing Topic-term smoothing.
+ */
+class GibbsSampling(
+data: RDD[Document],
+numOuterIterations: Int,
+numInnerIterations: Int,
+docTopicSmoothing: Double,
+topicTermSmoothing: Double)
+  extends Logging with Serializable {
+
+  import GibbsSampling._
+
+  /**
+   * Main function of running a Gibbs sampling method. It contains two 
phases of total Gibbs
+   * sampling: first is initialization, second is real sampling.
+   */
+  def runGibbsSampling(
+  initParams: LDAParams,
+  data: RDD[Document] = data,
+  numOuterIterations: Int = numOuterIterations,
+  numInnerIterations: Int = numInnerIterations,
+  docTopicSmoothing: Double = docTopicSmoothing,
+  topicTermSmoothing: Double = topicTermSmoothing): LDAParams = {
+
+val numTerms = initParams.topicTermCounts.head.size
+val numDocs = initParams.docCounts.size
+val numTopics = initParams.topicCounts.size
+
+// Construct topic assignment RDD
+logInfo("Start initialization")
+
+val cpInterval = 
System.getProperty("spark.gibbsSampling.checkPointInterval", "10").toInt
+val sc = data.context
+val (initialParams, initialChosenTopics) = 
sampleTermAssignment(initParams, data)
+
+// Gibbs sampling
+val (params, _, _) = Iterator.iterate((sc.accumulable(initialParams), 
initialChosenTopics, 0)) {
--- End diff --

Why an accumulator and not an .aggregate()?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126176
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.expectation
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV, sum}
+
+import org.apache.spark.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.clustering.{Document, LDAParams}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+
+/**
+ * Gibbs sampling from a given dataset and org.apache.spark.mllib.model.
+ * @param data Dataset, such as corpus.
+ * @param numOuterIterations Number of outer iteration.
+ * @param numInnerIterations Number of inner iteration, used in each 
partition.
+ * @param docTopicSmoothing Document-topic smoothing.
+ * @param topicTermSmoothing Topic-term smoothing.
+ */
+class GibbsSampling(
--- End diff --

Gibbs Sampling is a very useful general purpose tool to have. It's 
interface should be something more generic than RDD[Document], and the 
parameters should be amenable to domains other than text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...

2014-04-29 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/482#issuecomment-41752725
  
@marmbrus I am wondering the previously implementation(by change the 
nullability states from 2 to 3) perhaps more convenient in some way.
* Constant Folding rule is quite simple as "case e if e.nullable == 
alwaysNull => Literal(null, e.dataType)"
* We may not able to enumerate the extended expression (HiveGenericUdf for 
example) in catalyst.
* Nullability is a very helpful hint in the expression codegen:

Nullability   | eval code | expression null indicator variable | expression 
variable
- | - | --- | ---
Possibly be null  | Yes | Yes | Yes
Always be null  | No | Yes | No
Never be null  | Yes | No | Yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126094
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
+  extends Serializable {
+
+  def update(docId: Int, term: Int, topic: Int, inc: Int) = {
+docCounts.toBreeze(docId) += inc
+topicCounts.toBreeze(topic) += inc
+docTopicCounts(docId).toBreeze(topic) += inc
+topicTermCounts(topic).toBreeze(term) += inc
+this
+  }
+
+  def merge(other: LDAParams) = {
+docCounts.toBreeze += other.docCounts.toBreeze
+topicCounts.toBreeze += other.topicCounts.toBreeze
+
+var i = 0
+while (i < docTopicCounts.length) {
+  docTopicCounts(i).toBreeze += other.docTopicCounts(i).toBreeze
--- End diff --

more breeze conversion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126106
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
+  extends Serializable {
+
+  def update(docId: Int, term: Int, topic: Int, inc: Int) = {
+docCounts.toBreeze(docId) += inc
+topicCounts.toBreeze(topic) += inc
+docTopicCounts(docId).toBreeze(topic) += inc
+topicTermCounts(topic).toBreeze(term) += inc
+this
+  }
+
+  def merge(other: LDAParams) = {
+docCounts.toBreeze += other.docCounts.toBreeze
+topicCounts.toBreeze += other.topicCounts.toBreeze
+
+var i = 0
+while (i < docTopicCounts.length) {
+  docTopicCounts(i).toBreeze += other.docTopicCounts(i).toBreeze
+  i += 1
+}
+
+i = 0
+while (i < topicTermCounts.length) {
+  topicTermCounts(i).toBreeze += other.topicTermCounts(i).toBreeze
+  i += 1
+}
+this
+  }
+
+  /**
+   * This function used for computing the new distribution after drop one 
from current document,
+   * which is a really essential part of Gibbs sampling for LDA, you can 
refer to the paper:
+   * Parameter estimation for text analysis
--- End diff --

Link to paper, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/595#issuecomment-41752463
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/595#issuecomment-41752460
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...

2014-04-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/594


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126093
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
+  extends Serializable {
+
+  def update(docId: Int, term: Int, topic: Int, inc: Int) = {
+docCounts.toBreeze(docId) += inc
--- End diff --

Doing the breeze conversion on every update seems inefficient. These 
variables should be private and created as breeze variables at initialization, 
only the user facing APIs need to be Vector, Array[Vector], etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126088
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
+  extends Serializable {
+
+  def update(docId: Int, term: Int, topic: Int, inc: Int) = {
+docCounts.toBreeze(docId) += inc
+topicCounts.toBreeze(topic) += inc
+docTopicCounts(docId).toBreeze(topic) += inc
--- End diff --

Again, I think in this case the *model* might be really big - e.g. a 
billion documents in hundreds of topics. Or for the term side, millions of 
words in a vocabulary and hundreds of topics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test

2014-04-29 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/595

[SQL] SPARK-1661 - Fix regex_serde test

The JIRA in question is actually reporting a bug with Shark, but I wanted 
to make sure Spark SQL did not have similar problems.  This fixes a bug in our 
parsing code that was preventing the test from executing, but it looks like the 
RegexSerDe is working in Spark SQL.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark fixRegexSerdeTest

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/595.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #595


commit efa64024e2d8b3fad23f9773e2cbff88d9884048
Author: Michael Armbrust 
Date:   2014-04-30T01:34:08Z

Fix Hive serde_regex test.

commit a4dc6123e0d2bf030815c9a0528ac294141b8a24
Author: Michael Armbrust 
Date:   2014-04-30T01:34:56Z

Add files created by hive to gitignore.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126007
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
--- End diff --

Also - I don't think it should be a case class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12126002
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
--- End diff --

This should be called just "LDA" since it's the class that fits the model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12125991
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
--- End diff --

Hmm... if documents are just an ID and a list of token IDs, maybe something 
like a SparseVector is a better representation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...

2014-04-29 Thread etrain
Github user etrain commented on a diff in the pull request:

https://github.com/apache/spark/pull/476#discussion_r12125916
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala 
---
@@ -0,0 +1,169 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import breeze.linalg.{DenseVector => BDV}
+
+import org.apache.spark.{AccumulableParam, Logging, SparkContext}
+import org.apache.spark.mllib.expectation.GibbsSampling
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+
+case class Document(docId: Int, content: Iterable[Int])
+
+case class LDAParams (
+docCounts: Vector,
+topicCounts: Vector,
+docTopicCounts: Array[Vector],
+topicTermCounts: Array[Vector])
--- End diff --

I expect that this will be *really* big - maybe the last two variables 
should be RDDs - similar to what we do with ALS.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41751530
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41751535
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...

2014-04-29 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/582#issuecomment-41751464
  
Make sense from the inverse of hessian point of view. Just remove it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1672][WIP] Separate partitioning in ALS

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/593#issuecomment-41751187
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14576/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1672][WIP] Separate partitioning in ALS

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/593#issuecomment-41751186
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...

2014-04-29 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/482#discussion_r12125494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -223,6 +222,11 @@ abstract class Expression extends TreeNode[Expression] 
{
   }
 }
 
+/**
+ * Root class for rewritten 2 operands UDF expression. By default, we 
assume it produces Null if 
+ * either one of its operands is null. Exceptional case requires to update 
the optimization rule 
+ * at [[optimizer.ConstantFolding ConstantFolding]]
+ */
 abstract class BinaryExpression extends Expression with 
trees.BinaryNode[Expression] {
--- End diff --

That's true, adding document restrictions for BinaryExpression / 
UnaryExpression is quite confusing and error-prone. Besides the 
BinaryArithmetic, also the rewrite UDFs RLike / Cast / BinaryComparison 
etc.will be considered, too.

Probably enumerate all of the expression types in this rule makes more 
sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...

2014-04-29 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/594#issuecomment-41750700
  
Merged this, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...

2014-04-29 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/482#discussion_r12125319
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -87,6 +88,56 @@ object ColumnPruning extends Rule[LogicalPlan] {
 
 /**
  * Replaces [[catalyst.expressions.Expression Expressions]] that can be 
statically evaluated with
+ * equivalent [[catalyst.expressions.Literal Literal]] values. This rule 
is more specific with 
+ * Null value propagation from bottom to top of the expression tree.
+ */
+object NullPropagation extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case q: LogicalPlan => q transformExpressionsUp {
+  // Skip redundant folding of literals.
+  case l: Literal => l
--- End diff --

I was thinking if put the literal matching in the beginning, maybe helpful 
avoid the further pattern matching of the rest rules. Just a tiny performance 
optimization for Literal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1594][MLLIB] Cleaning up MLlib APIs and...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/524#issuecomment-41750423
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1594][MLLIB] Cleaning up MLlib APIs and...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/524#issuecomment-41750424
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14575/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/594#issuecomment-41750082
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/594#issuecomment-41750083
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14574/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...

2014-04-29 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/475#issuecomment-41749478
  
@mengxr There should be one but not sure. I will get back to you on this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...

2014-04-29 Thread manishamde
GitHub user manishamde reopened a pull request:

https://github.com/apache/spark/pull/475

SPARK-1544 Add support for deep decision trees.

@etrain and I came with a PR for arbitrarily deep decision trees at the 
cost of multiple passes over the data at deep tree levels. 

To summarize:
1) We take a parameter that indicates the amount of memory users want to 
reserve for computation on each worker (and 2x that at the driver).
2) Using that information, we calculate two things - the maximum depth to 
which we train as usual (which is, implicitly, the maximum number of nodes we 
want to train in parallel), and the size of the groups we should use in the 
case where we exceed this depth.

cc: @atalwalkar, @hirakendu, @mengxr

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manishamde/spark deep_tree

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/475.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #475


commit 50b143a4385f209fbc1793f3e03134cab3ab9583
Author: Manish Amde 
Date:   2014-04-20T20:33:03Z

adding support for very deep trees

commit abc5a23bf80d792a345d723b44bff3ee217cd5ac
Author: Evan Sparks 
Date:   2014-04-22T01:41:36Z

Parameterizing max memory.

commit 2f6072c12a1466d783da258d4aa1bde789e7e875
Author: manishamde 
Date:   2014-04-22T03:43:47Z

Merge pull request #5 from etrain/deep_tree

Parameterizing max memory.

commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a
Author: Manish Amde 
Date:   2014-04-22T03:49:46Z

minor: added doc for maxMemory parameter

commit 02877721328a560f210a7906061108ce5dd4bbbe
Author: Evan Sparks 
Date:   2014-04-22T18:13:27Z

Fixing scalastyle issue.

commit fecf89a51d6efc9e2ff06e700338ea944a4dd580
Author: manishamde 
Date:   2014-04-22T18:15:57Z

Merge pull request #6 from etrain/deep_tree

Fixing scalastyle issue.

commit 719d0098bb08b50e523cec3e388115d5a206512b
Author: Manish Amde 
Date:   2014-04-24T00:04:05Z

updating user documentation

commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000
Author: Manish Amde 
Date:   2014-04-29T21:43:19Z

merge from master

commit 15171550fe83e42fcb707744c9035ed540fb78d1
Author: Manish Amde 
Date:   2014-04-29T21:45:34Z

updated documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...

2014-04-29 Thread manishamde
Github user manishamde closed the pull request at:

https://github.com/apache/spark/pull/475


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS

2014-04-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/568#issuecomment-41749414
  
LGTM. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41749231
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14573/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1004. PySpark on YARN

2014-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/30#issuecomment-41749230
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >