[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41766159 @techaddict It looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41765699 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14582/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41765698 Build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/30 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user techaddict commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41765491 @mengxr @47ef86c392badc58052a0414115e49c2970b31eb looks good ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist Hive Tests
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/596#issuecomment-41765479 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist Hive Tests
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/596#issuecomment-41765473 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Whitelist Hive Tests
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/596 [SQL] Whitelist Hive Tests [WIP] Includes #595. Just putting it here for Jenkins. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark moreTests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/596.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #596 commit efa64024e2d8b3fad23f9773e2cbff88d9884048 Author: Michael Armbrust Date: 2014-04-30T01:34:08Z Fix Hive serde_regex test. commit a4dc6123e0d2bf030815c9a0528ac294141b8a24 Author: Michael Armbrust Date: 2014-04-30T01:34:56Z Add files created by hive to gitignore. commit 8ba8bc185327436a5ec214f0efd128c97ba25940 Author: Michael Armbrust Date: 2014-04-30T05:56:28Z update whitelist commit 1943ade74a760267e52d3990cb8f95d482af4c46 Author: Michael Armbrust Date: 2014-04-30T06:37:48Z More hive gitignore commit 8649e4467e459bc174e062ea54cbb37a5b840e4f Author: Michael Armbrust Date: 2014-04-30T06:38:32Z Add hive golden answers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41765289 I would say using `org.apache.spark.examples` for now, which requires little code change. Then we try to avoid using `private[spark]` in examples. Maybe there is an automatic way to detect it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12130313 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -64,7 +70,8 @@ private[spark] class EventLoggingListener( def start() { logInfo("Logging events to %s".format(logDir)) if (shouldCompress) { - val codec = conf.get("spark.io.compression.codec", CompressionCodec.DEFAULT_COMPRESSION_CODEC) --- End diff -- Ah I see - I misread this to only be a linebreak change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41764996 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12130230 --- Diff: core/src/test/scala/org/apache/spark/util/FileLoggerSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import java.io.IOException + +import scala.io.Source +import scala.util.Try + +import org.apache.hadoop.fs.Path +import org.scalatest.{BeforeAndAfter, FunSuite} + +import org.apache.spark.SparkConf +import org.apache.spark.io.CompressionCodec + +/** + * Test writing files through the FileLogger. + */ +class FileLoggerSuite extends FunSuite with BeforeAndAfter { + private val fileSystem = Utils.getHadoopFileSystem("/") + private val allCompressionCodecs = Seq[String]( +"org.apache.spark.io.LZFCompressionCodec", +"org.apache.spark.io.SnappyCompressionCodec" + ) + private val logDir = "/tmp/test-file-logger" + private val logDirPath = new Path(logDir) + + after { +Try { fileSystem.delete(logDirPath, true) } + } + + test("Simple logging") { +testSingleFile() + } + + test ("Simple logging with compression") { +allCompressionCodecs.foreach { codec => + testSingleFile(Some(codec)) +} + } + + test("Logging multiple files") { +testMultipleFiles() + } + + test("Logging multiple files with compression") { +allCompressionCodecs.foreach { codec => + testMultipleFiles(Some(codec)) +} + } + + test("Logging when directory already exists") { +// Create the logging directory multiple times +new FileLogger(logDir, new SparkConf, overwrite = true) +new FileLogger(logDir, new SparkConf, overwrite = true) +new FileLogger(logDir, new SparkConf, overwrite = true) + +// If overwrite is not enabled, an exception should be thrown +intercept[IOException] { new FileLogger(logDir, new SparkConf, overwrite = false) } + } + + + /* - * + * Actual test logic * + * - */ + + /** + * Test logging to a single file. + */ + private def testSingleFile(codecName: Option[String] = None) { +val conf = getLoggingConf(codecName) +val codec = codecName.map { c => CompressionCodec.createCodec(conf) } +val logger = + if (codecName.isDefined) { +new FileLogger(logDir, conf, compress = true) + } else { +new FileLogger(logDir, conf) + } +assert(fileSystem.exists(logDirPath)) +assert(fileSystem.getFileStatus(logDirPath).isDir) +assert(fileSystem.listStatus(logDirPath).size === 0) + +logger.newFile() +val files = fileSystem.listStatus(logDirPath) +assert(files.size === 1) +val firstFile = files.head +val firstFilePath = firstFile.getPath + +logger.log("hello") +logger.flush() +assert(readFileContent(firstFilePath, codec) === "hello") + +logger.log(" world") +logger.close() +assert(readFileContent(firstFilePath, codec) === "hello world") + } + + /** + * Test logging to multiple files. + */ + private def testMultipleFiles(codecName: Option[String] = None) { +val conf = getLoggingConf(codecName) +val codec = codecName.map { c => CompressionCodec.createCodec(conf) } +val logger = + if (codecName.isDefined) { +new FileLogger(logDir, conf, compress = true) + } else { +new FileLogger(logDir, conf) + } + +logger.newFile("Jean_Valjean") +logger.logLine("Who am I?") +logger.logLine("Destiny?") +logger.newFile("John_Valjohn") --- End diff -- Oops, my bad. It's hard to get this one right. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41764997 Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14581/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41764738 Also, we should definitely document how to set up PySpark on YARN, so the user doesn't have to jump through hoops to get a simple job running. The biggest thing is probably emphasize that it only works if we build with maven. Maybe we should also have a section that explains what to do when you run into the unhelpful `java.io.EOFException`. Or better still, throw a nicer exception message that prints out the PYTHONPATH and complains that it can't find pyspark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41764569 Thanks - merged! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/30#discussion_r12129988 --- Diff: sbin/spark-config.sh --- @@ -34,3 +34,6 @@ this="$config_bin/$script" export SPARK_PREFIX=`dirname "$this"`/.. export SPARK_HOME=${SPARK_PREFIX} export SPARK_CONF_DIR="$SPARK_HOME/conf" +# Add the PySpark classes to the PYTHONPATH: +export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH --- End diff -- Looks like we never addressed this. Should we move this into `spark-submit`, now that we have that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41764337 Just confirmed that this works on a CDH cluster. This should be ready for merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1556: bump jets3t version to 0.9.0
Github user darose commented on the pull request: https://github.com/apache/spark/pull/468#issuecomment-41764125 So @srowen, I think @mateiz is right, the CDH5 spark-core package (on Ubuntu, it's version 0.9.0+cdh5.0.0+31-1.cdh5.0.0.p0.31~precise-cdh5.0.0) won't function correctly due to this issue and so would need to get rebuilt against jets3t 0.9.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12129800 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -64,7 +70,8 @@ private[spark] class EventLoggingListener( def start() { logInfo("Logging events to %s".format(logDir)) if (shouldCompress) { - val codec = conf.get("spark.io.compression.codec", CompressionCodec.DEFAULT_COMPRESSION_CODEC) --- End diff -- I renamed it to sparkConf, because there's also a hadoopConf now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/591#issuecomment-41763679 Unfortunately I only had time to do a cursory glance. I mostly just sanity checked the changes to the non-test code to make sure there were no mistakes. Looks good to me pending some small comments about the tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12129684 --- Diff: core/src/test/scala/org/apache/spark/util/FileLoggerSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import java.io.IOException + +import scala.io.Source +import scala.util.Try + +import org.apache.hadoop.fs.Path +import org.scalatest.{BeforeAndAfter, FunSuite} + +import org.apache.spark.SparkConf +import org.apache.spark.io.CompressionCodec + +/** + * Test writing files through the FileLogger. + */ +class FileLoggerSuite extends FunSuite with BeforeAndAfter { + private val fileSystem = Utils.getHadoopFileSystem("/") + private val allCompressionCodecs = Seq[String]( +"org.apache.spark.io.LZFCompressionCodec", +"org.apache.spark.io.SnappyCompressionCodec" + ) + private val logDir = "/tmp/test-file-logger" + private val logDirPath = new Path(logDir) + + after { +Try { fileSystem.delete(logDirPath, true) } + } + + test("Simple logging") { +testSingleFile() + } + + test ("Simple logging with compression") { +allCompressionCodecs.foreach { codec => + testSingleFile(Some(codec)) +} + } + + test("Logging multiple files") { +testMultipleFiles() + } + + test("Logging multiple files with compression") { +allCompressionCodecs.foreach { codec => + testMultipleFiles(Some(codec)) +} + } + + test("Logging when directory already exists") { +// Create the logging directory multiple times +new FileLogger(logDir, new SparkConf, overwrite = true) +new FileLogger(logDir, new SparkConf, overwrite = true) +new FileLogger(logDir, new SparkConf, overwrite = true) + +// If overwrite is not enabled, an exception should be thrown +intercept[IOException] { new FileLogger(logDir, new SparkConf, overwrite = false) } + } + + + /* - * + * Actual test logic * + * - */ + + /** + * Test logging to a single file. + */ + private def testSingleFile(codecName: Option[String] = None) { +val conf = getLoggingConf(codecName) +val codec = codecName.map { c => CompressionCodec.createCodec(conf) } +val logger = + if (codecName.isDefined) { +new FileLogger(logDir, conf, compress = true) + } else { +new FileLogger(logDir, conf) + } +assert(fileSystem.exists(logDirPath)) +assert(fileSystem.getFileStatus(logDirPath).isDir) +assert(fileSystem.listStatus(logDirPath).size === 0) + +logger.newFile() +val files = fileSystem.listStatus(logDirPath) +assert(files.size === 1) +val firstFile = files.head +val firstFilePath = firstFile.getPath + +logger.log("hello") +logger.flush() +assert(readFileContent(firstFilePath, codec) === "hello") + +logger.log(" world") +logger.close() +assert(readFileContent(firstFilePath, codec) === "hello world") + } + + /** + * Test logging to multiple files. + */ + private def testMultipleFiles(codecName: Option[String] = None) { +val conf = getLoggingConf(codecName) +val codec = codecName.map { c => CompressionCodec.createCodec(conf) } +val logger = + if (codecName.isDefined) { +new FileLogger(logDir, conf, compress = true) + } else { +new FileLogger(logDir, conf) + } + +logger.newFile("Jean_Valjean") +logger.logLine("Who am I?") +logger.logLine("Destiny?") +logger.newFile("John_Valjohn") --- End diff -- You spelled this `John_Valjohn` but the correct spelling is `Hugh Jackman`. --- If your project is set up for it, you can reply to this email and have
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12129638 --- Diff: core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala --- @@ -0,0 +1,413 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable +import scala.io.Source +import scala.util.Try + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.json4s.jackson.JsonMethods._ +import org.scalatest.{BeforeAndAfter, FunSuite} + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.io.CompressionCodec +import org.apache.spark.util.{JsonProtocol, Utils} + +/** + * Test whether EventLoggingListener logs events properly. + * + * This tests whether EventLoggingListener actually creates special files while logging events, + * whether the parsing of these special files is correct, and whether the logged events can be + * read and deserialized into actual SparkListenerEvents. + */ +class EventLoggingListenerSuite extends FunSuite with BeforeAndAfter { + private val fileSystem = Utils.getHadoopFileSystem("/") + private val allCompressionCodecs = Seq[String]( +"org.apache.spark.io.LZFCompressionCodec", +"org.apache.spark.io.SnappyCompressionCodec" + ) + + after { +Try { fileSystem.delete(new Path("/tmp/spark-events"), true) } --- End diff -- It would be better if you manually specified the log directory for all tests related to the comment below. Ideally these tests should run even on an OS that doesn't have a `/tmp` directory. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12129632 --- Diff: core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala --- @@ -0,0 +1,413 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import scala.collection.mutable +import scala.io.Source +import scala.util.Try + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.json4s.jackson.JsonMethods._ +import org.scalatest.{BeforeAndAfter, FunSuite} + +import org.apache.spark.{SparkConf, SparkContext} +import org.apache.spark.io.CompressionCodec +import org.apache.spark.util.{JsonProtocol, Utils} + +/** + * Test whether EventLoggingListener logs events properly. + * + * This tests whether EventLoggingListener actually creates special files while logging events, + * whether the parsing of these special files is correct, and whether the logged events can be + * read and deserialized into actual SparkListenerEvents. + */ +class EventLoggingListenerSuite extends FunSuite with BeforeAndAfter { + private val fileSystem = Utils.getHadoopFileSystem("/") + private val allCompressionCodecs = Seq[String]( +"org.apache.spark.io.LZFCompressionCodec", +"org.apache.spark.io.SnappyCompressionCodec" + ) + + after { +Try { fileSystem.delete(new Path("/tmp/spark-events"), true) } +Try { fileSystem.delete(new Path("/tmp/spark-foo"), true) } --- End diff -- You should use a utility for creating a temporary directory in these tests rather than hard coding `/tmp`. Some platforms (e.g. windows) won't have this directory. Checkout the other test suites, we do this a lot in Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/591#discussion_r12129472 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -64,7 +70,8 @@ private[spark] class EventLoggingListener( def start() { logInfo("Logging events to %s".format(logDir)) if (shouldCompress) { - val codec = conf.get("spark.io.compression.codec", CompressionCodec.DEFAULT_COMPRESSION_CODEC) --- End diff -- I don't think this was over the line limit before (?) wouldn't it have failed our style checks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Args for worker rather than master
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/587 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Handle the vals that never used
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/565 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/568 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41762289 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41762297 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user techaddict commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41761434 @mengxr what do you suggest how should we resolve the private[spark] problem ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Handle the vals that never used
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/565#issuecomment-41761240 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Args for worker rather than master
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/587#issuecomment-41761176 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41761129 @andrewor14 if you could do a final test on this just to double check, I think it's good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/568#issuecomment-41761125 Merged. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41761073 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41761067 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41761064 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41761038 @techaddict I don't see anything necessary except the one using `private[spark]` or `private[streaming]` methods. For consistency, `org.apache.examples` may be better because many examples are already under this package name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41759555 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41759556 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14580/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1637: Clean up examples for 1.0
Github user techaddict commented on the pull request: https://github.com/apache/spark/pull/571#issuecomment-41758235 @mengxr so what changes should i make other than streaming one suggested by tdas. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/591#issuecomment-41758189 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/591#issuecomment-41758190 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14579/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41758004 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41758002 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41757854 Updated PR moves unzipping py4j to an earlier phase so that it gets included the first time around. Tested it out and saw it appear in the jar the first time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41757036 @sryza I also asked @ahirreddy to look into whether we can just publish a jar to maven central that contains the py4j python side. Then we can just depend on that jar and be done with it and not count on `unzip` being installed. He seemed to think it was possible... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41756809 Yeah, that's the issue. Looking into the best way to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41756737 @andrewor14 So you are saying that if I just download this from a blank build it won't work... but if I happen to build twice it will work. I wonder if the issue might be that the maven-exec-plugin isn't guarenteed to execute before the packaging of the jar itself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/591#issuecomment-41756602 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add tests for FileLogger, EventLoggingListener...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/591#issuecomment-41756609 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41755910 I would be happy to talk more about this after the OSDI deadline. As far as storing the model (or more precisely the counts and samples) as an a RDD, I think this really is necessary. The model in this case should be on the order of the size of the data. Essentially what you want is the ability to join the term topic counts with the document topic counts for each token in a given document. Given these two counts tables (along with the background distribution of topics in the entire corpus) you can compute the new topic assignment. Here is an implementation of the collapsed Gibbs sampler for LDA using GraphX: https://github.com/amplab/graphx/pull/113 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/595#issuecomment-41755913 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12127214 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) --- End diff -- You mean "termId: count" ? Yes it is a common way to do that. But I just consider the trade-off between statistical efficiency and hardware efficiency. If we combine the same term together in one document, it seems that the randomness is worse. Anyway, I'll try to modified it using SparseVector. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/595#issuecomment-41755914 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14578/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12127051 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) --- End diff -- That's make sense. I think the `docTopicCounts` could be sliced easily W.R.T. documents partitions. But for `topicTermCounts`, it's hard to do slice. I'll find a way to settle it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41755358 Yep, I know @jegonzal for his paper *Parallel Gibbs Sampling*. But I only have the idea of the implementation on GraphLab and not find the impl in GraphX. It's great if I have the chance to talk with Joseph offline. Besides, I will add a use case for reuters dataset and try to fix the issues put above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Update GradientDescentSuite.scala
Github user baishuo commented on the pull request: https://github.com/apache/spark/pull/588#issuecomment-41755330 had modified, thanks to @techaddict @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/482#issuecomment-41754201 Understood. thank you @marmbrus , I will add more unit tests based on current implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41753636 Also, speaking of @jegonzal maybe this is a natural first point of integration between MLlib and GraphX - I know GraphX has an implementation of LDA built in, and maybe this is a chance for us to leverage that work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/476#issuecomment-41753574 Before I get too deep into this review - I want to step back and think about whether we expect the model in this case to be on the order of the size of the data - I think it is, and if so, we may want to consider representing the model as RDD[DocumentTopicFeatures] and RDD[TopicWordFeatures], similar to what we do with ALS. This may change the algorithm substantially. Separately, maybe it makes sense to have a concrete use case to work with (reuters dataset or something) so that we can evaluate how much memory actually gets used given a reasonably sized corpus. Perhaps @mengxr or @jegonzal has a strong opinion on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/482#issuecomment-41753498 I think the problem with the previous implementation is the change is much larger and therefore harder to understand. (It took me a while to look at it and find bugs, but there were quite a few and I'm sure I didn't find them all.) A lot of concepts _could_ be added to the expression hierarchy, but the goal of Catalyst is to instead allow self contained rules to perform optimizations. By keeping these optimization rules self contained it is much easier to look in one place and reason about their correctness. Making the rule simple while pushing all of the complex logic to many different places in the code is not the goal here. The only reason nullability is modeled in the expression hierarchy at all is because it is a fairly fundamental concept that is part of the schema of the data. Regarding HiveGenericUdf, I'm not sure how the three state solution allows us to do any extra folding. The UDF is still a black box and we have no idea how its going to respond to null inputs. Regarding your last point about code generation, we don't need three states. We already know if we need to track a nullable bit when doing code generation with the current two state implementation. If it is possible statically determine a value to be null, that will have already been done by optimization rules and we will just have a null literal. All we need is a special rule that generates the AST for `Literal(null, _)`. So, actually I think the code generation logic is simpler with this implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41753419 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14577/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41753415 Update - with the latest commits from master I am able to run PySpark on YARN successfully on both CDH and HDP clusters. There is still an issue with the maven build, however. The first build produces a jar that does not include the `py4j/*.py` files, while the second build produces one that does include all needed files. This is because we try to include these python files before unzipping `py4j*zip`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41753418 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126271 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -20,15 +20,17 @@ package org.apache.spark.mllib.util import scala.reflect.ClassTag import breeze.linalg.{Vector => BV, SparseVector => BSV, squaredDistance => breezeSquaredDistance} +import breeze.util.Index +import chalk.text.tokenize.JavaWordTokenizer --- End diff -- I think chalk for tokenization is probably a fine choice for starters - but can you say what it buys us over regular scala StringTokenizer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126247 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.expectation + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV, sum} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.clustering.{Document, LDAParams} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * Gibbs sampling from a given dataset and org.apache.spark.mllib.model. + * @param data Dataset, such as corpus. + * @param numOuterIterations Number of outer iteration. + * @param numInnerIterations Number of inner iteration, used in each partition. + * @param docTopicSmoothing Document-topic smoothing. + * @param topicTermSmoothing Topic-term smoothing. + */ +class GibbsSampling( +data: RDD[Document], +numOuterIterations: Int, +numInnerIterations: Int, +docTopicSmoothing: Double, +topicTermSmoothing: Double) + extends Logging with Serializable { + + import GibbsSampling._ + + /** + * Main function of running a Gibbs sampling method. It contains two phases of total Gibbs + * sampling: first is initialization, second is real sampling. + */ + def runGibbsSampling( + initParams: LDAParams, + data: RDD[Document] = data, + numOuterIterations: Int = numOuterIterations, + numInnerIterations: Int = numInnerIterations, + docTopicSmoothing: Double = docTopicSmoothing, + topicTermSmoothing: Double = topicTermSmoothing): LDAParams = { + +val numTerms = initParams.topicTermCounts.head.size +val numDocs = initParams.docCounts.size +val numTopics = initParams.topicCounts.size + +// Construct topic assignment RDD +logInfo("Start initialization") + +val cpInterval = System.getProperty("spark.gibbsSampling.checkPointInterval", "10").toInt +val sc = data.context +val (initialParams, initialChosenTopics) = sampleTermAssignment(initParams, data) + +// Gibbs sampling +val (params, _, _) = Iterator.iterate((sc.accumulable(initialParams), initialChosenTopics, 0)) { + case (lastParams, lastChosenTopics, i) => +logInfo("Start Gibbs sampling") + +val rand = new Random(42 + i * i) +val params = sc.accumulable(LDAParams(numDocs, numTopics, numTerms)) +val chosenTopics = data.zip(lastChosenTopics).map { + case (Document(docId, content), topics) => +content.zip(topics).map { case (term, topic) => + lastParams += (docId, term, topic, -1) + + val chosenTopic = lastParams.localValue.dropOneDistSampler( +docTopicSmoothing, topicTermSmoothing, term, docId, rand) + + lastParams += (docId, term, chosenTopic, 1) + params += (docId, term, chosenTopic, 1) + + chosenTopic +} +}.cache() + +if (i + 1 % cpInterval == 0) { + chosenTopics.checkpoint() +} + +// Trigger a job to collect accumulable LDA parameters. +chosenTopics.count() +lastChosenTopics.unpersist() + +(params, chosenTopics, i + 1) +}.drop(1 + numOuterIterations).next() + +params.value + } + + /** + * Model matrix Phi and Theta are inferred via LDAParams. + */ + def solvePhiAndTheta( + params: LDAParams, + docTopicSmoothing: Double = docTopicSmoothing, + topicTermSmoothing: Double = topicTermSmoothing): (Array[Vector], Array[Vector]) = { --- End diff -- Again, Phi and Theta might be too big. --- If your project is set up for it, y
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126191 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.expectation + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV, sum} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.clustering.{Document, LDAParams} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * Gibbs sampling from a given dataset and org.apache.spark.mllib.model. + * @param data Dataset, such as corpus. + * @param numOuterIterations Number of outer iteration. + * @param numInnerIterations Number of inner iteration, used in each partition. + * @param docTopicSmoothing Document-topic smoothing. + * @param topicTermSmoothing Topic-term smoothing. + */ +class GibbsSampling( +data: RDD[Document], +numOuterIterations: Int, +numInnerIterations: Int, +docTopicSmoothing: Double, +topicTermSmoothing: Double) + extends Logging with Serializable { + + import GibbsSampling._ + + /** + * Main function of running a Gibbs sampling method. It contains two phases of total Gibbs + * sampling: first is initialization, second is real sampling. + */ + def runGibbsSampling( + initParams: LDAParams, + data: RDD[Document] = data, + numOuterIterations: Int = numOuterIterations, + numInnerIterations: Int = numInnerIterations, + docTopicSmoothing: Double = docTopicSmoothing, + topicTermSmoothing: Double = topicTermSmoothing): LDAParams = { + +val numTerms = initParams.topicTermCounts.head.size +val numDocs = initParams.docCounts.size +val numTopics = initParams.topicCounts.size + +// Construct topic assignment RDD +logInfo("Start initialization") + +val cpInterval = System.getProperty("spark.gibbsSampling.checkPointInterval", "10").toInt +val sc = data.context +val (initialParams, initialChosenTopics) = sampleTermAssignment(initParams, data) + +// Gibbs sampling +val (params, _, _) = Iterator.iterate((sc.accumulable(initialParams), initialChosenTopics, 0)) { --- End diff -- Why an accumulator and not an .aggregate()? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126176 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/expectation/GibbsSampling.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.expectation + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV, sum} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.clustering.{Document, LDAParams} +import org.apache.spark.mllib.linalg.{Vector, Vectors} + +/** + * Gibbs sampling from a given dataset and org.apache.spark.mllib.model. + * @param data Dataset, such as corpus. + * @param numOuterIterations Number of outer iteration. + * @param numInnerIterations Number of inner iteration, used in each partition. + * @param docTopicSmoothing Document-topic smoothing. + * @param topicTermSmoothing Topic-term smoothing. + */ +class GibbsSampling( --- End diff -- Gibbs Sampling is a very useful general purpose tool to have. It's interface should be something more generic than RDD[Document], and the parameters should be amenable to domains other than text. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/482#issuecomment-41752725 @marmbrus I am wondering the previously implementation(by change the nullability states from 2 to 3) perhaps more convenient in some way. * Constant Folding rule is quite simple as "case e if e.nullable == alwaysNull => Literal(null, e.dataType)" * We may not able to enumerate the extended expression (HiveGenericUdf for example) in catalyst. * Nullability is a very helpful hint in the expression codegen: Nullability | eval code | expression null indicator variable | expression variable - | - | --- | --- Possibly be null | Yes | Yes | Yes Always be null | No | Yes | No Never be null | Yes | No | Yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126094 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) + extends Serializable { + + def update(docId: Int, term: Int, topic: Int, inc: Int) = { +docCounts.toBreeze(docId) += inc +topicCounts.toBreeze(topic) += inc +docTopicCounts(docId).toBreeze(topic) += inc +topicTermCounts(topic).toBreeze(term) += inc +this + } + + def merge(other: LDAParams) = { +docCounts.toBreeze += other.docCounts.toBreeze +topicCounts.toBreeze += other.topicCounts.toBreeze + +var i = 0 +while (i < docTopicCounts.length) { + docTopicCounts(i).toBreeze += other.docTopicCounts(i).toBreeze --- End diff -- more breeze conversion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126106 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) + extends Serializable { + + def update(docId: Int, term: Int, topic: Int, inc: Int) = { +docCounts.toBreeze(docId) += inc +topicCounts.toBreeze(topic) += inc +docTopicCounts(docId).toBreeze(topic) += inc +topicTermCounts(topic).toBreeze(term) += inc +this + } + + def merge(other: LDAParams) = { +docCounts.toBreeze += other.docCounts.toBreeze +topicCounts.toBreeze += other.topicCounts.toBreeze + +var i = 0 +while (i < docTopicCounts.length) { + docTopicCounts(i).toBreeze += other.docTopicCounts(i).toBreeze + i += 1 +} + +i = 0 +while (i < topicTermCounts.length) { + topicTermCounts(i).toBreeze += other.topicTermCounts(i).toBreeze + i += 1 +} +this + } + + /** + * This function used for computing the new distribution after drop one from current document, + * which is a really essential part of Gibbs sampling for LDA, you can refer to the paper: + * Parameter estimation for text analysis --- End diff -- Link to paper, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/595#issuecomment-41752463 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/595#issuecomment-41752460 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/594 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126093 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) + extends Serializable { + + def update(docId: Int, term: Int, topic: Int, inc: Int) = { +docCounts.toBreeze(docId) += inc --- End diff -- Doing the breeze conversion on every update seems inefficient. These variables should be private and created as breeze variables at initialization, only the user facing APIs need to be Vector, Array[Vector], etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126088 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) + extends Serializable { + + def update(docId: Int, term: Int, topic: Int, inc: Int) = { +docCounts.toBreeze(docId) += inc +topicCounts.toBreeze(topic) += inc +docTopicCounts(docId).toBreeze(topic) += inc --- End diff -- Again, I think in this case the *model* might be really big - e.g. a billion documents in hundreds of topics. Or for the term side, millions of words in a vocabulary and hundreds of topics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] SPARK-1661 - Fix regex_serde test
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/595 [SQL] SPARK-1661 - Fix regex_serde test The JIRA in question is actually reporting a bug with Shark, but I wanted to make sure Spark SQL did not have similar problems. This fixes a bug in our parsing code that was preventing the test from executing, but it looks like the RegexSerDe is working in Spark SQL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark fixRegexSerdeTest Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/595.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #595 commit efa64024e2d8b3fad23f9773e2cbff88d9884048 Author: Michael Armbrust Date: 2014-04-30T01:34:08Z Fix Hive serde_regex test. commit a4dc6123e0d2bf030815c9a0528ac294141b8a24 Author: Michael Armbrust Date: 2014-04-30T01:34:56Z Add files created by hive to gitignore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126007 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( --- End diff -- Also - I don't think it should be a case class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12126002 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( --- End diff -- This should be called just "LDA" since it's the class that fits the model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12125991 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) --- End diff -- Hmm... if documents are just an ID and a list of token IDs, maybe something like a SparseVector is a better representation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: JIRA issue: [SPARK-1405] Gibbs sampling based ...
Github user etrain commented on a diff in the pull request: https://github.com/apache/spark/pull/476#discussion_r12125916 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,169 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import breeze.linalg.{DenseVector => BDV} + +import org.apache.spark.{AccumulableParam, Logging, SparkContext} +import org.apache.spark.mllib.expectation.GibbsSampling +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD + +case class Document(docId: Int, content: Iterable[Int]) + +case class LDAParams ( +docCounts: Vector, +topicCounts: Vector, +docTopicCounts: Array[Vector], +topicTermCounts: Array[Vector]) --- End diff -- I expect that this will be *really* big - maybe the last two variables should be RDDs - similar to what we do with ALS. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41751530 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41751535 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1157][MLlib] Bug fix: lossHistory shoul...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/582#issuecomment-41751464 Make sense from the inverse of hessian point of view. Just remove it! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1672][WIP] Separate partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/593#issuecomment-41751187 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14576/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1672][WIP] Separate partitioning in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/593#issuecomment-41751186 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/482#discussion_r12125494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -223,6 +222,11 @@ abstract class Expression extends TreeNode[Expression] { } } +/** + * Root class for rewritten 2 operands UDF expression. By default, we assume it produces Null if + * either one of its operands is null. Exceptional case requires to update the optimization rule + * at [[optimizer.ConstantFolding ConstantFolding]] + */ abstract class BinaryExpression extends Expression with trees.BinaryNode[Expression] { --- End diff -- That's true, adding document restrictions for BinaryExpression / UnaryExpression is quite confusing and error-prone. Besides the BinaryArithmetic, also the rewrite UDFs RLike / Cast / BinaryComparison etc.will be considered, too. Probably enumerate all of the expression types in this rule makes more sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/594#issuecomment-41750700 Merged this, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][Spark-SQL] Optimize the Constant Folding...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/482#discussion_r12125319 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -87,6 +88,56 @@ object ColumnPruning extends Rule[LogicalPlan] { /** * Replaces [[catalyst.expressions.Expression Expressions]] that can be statically evaluated with + * equivalent [[catalyst.expressions.Literal Literal]] values. This rule is more specific with + * Null value propagation from bottom to top of the expression tree. + */ +object NullPropagation extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case q: LogicalPlan => q transformExpressionsUp { + // Skip redundant folding of literals. + case l: Literal => l --- End diff -- I was thinking if put the literal matching in the beginning, maybe helpful avoid the further pattern matching of the rest rules. Just a tiny performance optimization for Literal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1594][MLLIB] Cleaning up MLlib APIs and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/524#issuecomment-41750423 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1594][MLLIB] Cleaning up MLlib APIs and...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/524#issuecomment-41750424 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14575/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/594#issuecomment-41750082 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1674] fix interrupted system call error...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/594#issuecomment-41750083 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14574/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/475#issuecomment-41749478 @mengxr There should be one but not sure. I will get back to you on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...
GitHub user manishamde reopened a pull request: https://github.com/apache/spark/pull/475 SPARK-1544 Add support for deep decision trees. @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr You can merge this pull request into a Git repository by running: $ git pull https://github.com/manishamde/spark deep_tree Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/475.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #475 commit 50b143a4385f209fbc1793f3e03134cab3ab9583 Author: Manish Amde Date: 2014-04-20T20:33:03Z adding support for very deep trees commit abc5a23bf80d792a345d723b44bff3ee217cd5ac Author: Evan Sparks Date: 2014-04-22T01:41:36Z Parameterizing max memory. commit 2f6072c12a1466d783da258d4aa1bde789e7e875 Author: manishamde Date: 2014-04-22T03:43:47Z Merge pull request #5 from etrain/deep_tree Parameterizing max memory. commit 2f1e093c5187a1ed532f9c19b25f8a2a6a46e27a Author: Manish Amde Date: 2014-04-22T03:49:46Z minor: added doc for maxMemory parameter commit 02877721328a560f210a7906061108ce5dd4bbbe Author: Evan Sparks Date: 2014-04-22T18:13:27Z Fixing scalastyle issue. commit fecf89a51d6efc9e2ff06e700338ea944a4dd580 Author: manishamde Date: 2014-04-22T18:15:57Z Merge pull request #6 from etrain/deep_tree Fixing scalastyle issue. commit 719d0098bb08b50e523cec3e388115d5a206512b Author: Manish Amde Date: 2014-04-24T00:04:05Z updating user documentation commit 9dbdabeeacc5fe5e0f1a729ce1ed8ab6ff399000 Author: Manish Amde Date: 2014-04-29T21:43:19Z merge from master commit 15171550fe83e42fcb707744c9035ed540fb78d1 Author: Manish Amde Date: 2014-04-29T21:45:34Z updated documentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1544 Add support for deep decision trees...
Github user manishamde closed the pull request at: https://github.com/apache/spark/pull/475 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1646] Micro-optimisation of ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/568#issuecomment-41749414 LGTM. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41749231 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14573/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1004. PySpark on YARN
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/30#issuecomment-41749230 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---