[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72318533
  
  [Test build #26463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26463/consoleFull)
 for   PR 4193 at commit 
[`5b33f66`](https://github.com/apache/spark/commit/5b33f66424155bc4d3d58315a4ef2168177b802f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class IsotonicRegressionModel (`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72318539
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26463/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72319231
  
  [Test build #26464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26464/consoleFull)
 for   PR 3976 at commit 
[`d60bc60`](https://github.com/apache/spark/commit/d60bc6069cf65637622472ef1cd2715df53c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72319354
  
I think we can relax the constraint of `toDataFrame`. So we can directly 
create a DataFrame with partial columns of a RDD of tuples. No need to call an 
extra `select`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72319380
  
  [Test build #26465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26465/consoleFull)
 for   PR 4298 at commit 
[`f36efb5`](https://github.com/apache/spark/commit/f36efb56b40fd9e191587ed79d842efdb2358b39).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread dibbhatt
Github user dibbhatt commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23889636
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/4292#discussion_r2348
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
 assertNotStopped()
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
--- End diff --

if mode is not yarn, SparkHadoopUtil.addCredentials didnot do anything. so 
here it donot resolve when non-Yarn mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread dibbhatt
Github user dibbhatt commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23889938
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/4155#discussion_r23888903
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -808,6 +810,7 @@ class DAGScheduler(
 // will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
 // event.
 stage.latestInfo = StageInfo.fromStage(stage, 
Some(partitionsToCompute.size))
+outputCommitCoordinator.stageStart(stage.id)
--- End diff --

yes, i agree with @vanzin 's Opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72321243
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26464/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72321234
  
  [Test build #26464 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26464/consoleFull)
 for   PR 3976 at commit 
[`d60bc60`](https://github.com/apache/spark/commit/d60bc6069cf65637622472ef1cd2715df53c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72321964
  
  [Test build #26465 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26465/consoleFull)
 for   PR 4298 at commit 
[`f36efb5`](https://github.com/apache/spark/commit/f36efb56b40fd9e191587ed79d842efdb2358b39).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `val elem = sarray (class $`
  * `val elem = sexternalizable object (class $`
  * `val elem = sobject (class $`
  * `  implicit class ObjectStreamClassMethods(val desc: ObjectStreamClass) 
extends AnyVal `
  * `class IsotonicRegressionModel (`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72321965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26465/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23889820
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72323941
  
@JoshRosen can you help me? now i add a unit test for python applicaiton on 
yarn cluster mode. but now there is a failed. i think its reason is environment 
 of pyspark are not set. can you tell me how to set  pyspark's environment in 
my java unit test.thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72317547
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26461/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72317545
  
  [Test build #26461 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26461/consoleFull)
 for   PR 4298 at commit 
[`2c9f370`](https://github.com/apache/spark/commit/2c9f370b312d241e6c72abe812614ac1590a032f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72317452
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26462/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72317446
  
  [Test build #26462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26462/consoleFull)
 for   PR 3976 at commit 
[`5b30064`](https://github.com/apache/spark/commit/5b300648fe53d9de604e8afce7580fddfe6bbaef).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3519


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3975] Added support for BlockMatrix add...

2015-01-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4274#issuecomment-72309957
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3975] Added support for BlockMatrix add...

2015-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4274


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4193#discussion_r23887726
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 
---
@@ -152,7 +152,7 @@ class RowMatrix(
* storing the right singular vectors, is computed via matrix 
multiplication as
* U = A * (V * S^-1^), if requested by user. The actual method to use 
is determined
* automatically based on the cost:
-   *  - If n is small (n lt; 100) or k is large compared with n (k  n / 
2), we compute the Gramian
+   *  - If n is small (n lt; 100) or k is large compared with n (k gt; n 
/ 2), we compute the Gramian
--- End diff --

Is this line too wide?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-31 Thread jacek-lewandowski
Github user jacek-lewandowski commented on the pull request:

https://github.com/apache/spark/pull/4220#issuecomment-72309821
  
Agreed, I'll make the changes and add the clarifying comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/4294#discussion_r23887546
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala ---
@@ -28,10 +32,10 @@ private[sql] class DefaultSource extends 
RelationProvider with SchemaRelationPro
   override def createRelation(
   sqlContext: SQLContext,
   parameters: Map[String, String]): BaseRelation = {
-val fileName = parameters.getOrElse(path, sys.error(Option 'path' 
not specified))
+val path = parameters.getOrElse(path, sys.error(Option 'path' not 
specified))
--- End diff --

I think we should define common data source options like `path` in a single 
place. But we can leave it to another PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1484#discussion_r23887545
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors, 
Vector}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.stat.Statistics
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Chi Squared selector model.
+ *
+ * @param indices list of indices to select (filter)
+ */
+@Experimental
+class ChiSqSelectorModel(indices: Array[Int]) extends VectorTransformer {
+  /**
+   * Applies transformation on a vector.
+   *
+   * @param vector vector to be transformed.
+   * @return transformed vector.
+   */
+  override def transform(vector: Vector): Vector = {
+Compress(vector, indices)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Creates a ChiSquared feature selector.
+ * @param numTopFeatures number of features that selector will select
+ *   (ordered by statistic value descending)
+ */
+@Experimental
+class ChiSqSelector (val numTopFeatures: Int) {
+
+  /**
+   * Returns a ChiSquared feature selector.
+   *
+   * @param data data used to compute the Chi Squared statistic.
+   */
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.sortBy { case(res, _) = -res.statistic }
+  .take(numTopFeatures)
+  .map{ case(_, indices) = indices }
+new ChiSqSelectorModel(indices)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Filters features in a given vector
+ */
+@Experimental
+object Compress {
+  /**
+   * Returns a vector with features filtered.
+   * Preserves the order of filtered features the same as their indices 
are stored.
+   * @param features vector
+   * @param filterIndices indices of features to filter
+   */
+  def apply(features: Vector, filterIndices: Array[Int]): Vector = {
+features match {
+  case SparseVector(size, indices, values) =
+val filterMap = filterIndices.zipWithIndex.toMap
--- End diff --

Please see my new inline comments about the constructor. Basically, if we 
expose the constructor, we should check the ordering of `indices`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5307] Add a config option for Serializa...

2015-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4297


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5307] Add a config option for Serializa...

2015-01-31 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4297#issuecomment-72309051
  
cc @pwendell we can add this to the release notes. If the debugger 
malfunctions, it can be turned off with this flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3519#issuecomment-72309076
  
  [Test build #26458 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26458/consoleFull)
 for   PR 3519 at commit 
[`5a54ea4`](https://github.com/apache/spark/commit/5a54ea45334b9ef1273a7303be0ef20b97896c92).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class IsotonicRegressionModel (`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3519#issuecomment-72309078
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26458/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3519#issuecomment-72309894
  
LGTM. Merged into master. Thanks!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887814
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
--- End diff --

Should each model have its own version? For example, we might add some 
statistics to LRModel and save them during model export. If the version is 
global, we might have trouble loading it back.

With the new DataFrame API, this could be done by

~~~
sc.parallelize(Seq((clazz, version))).toDataFrame(clazz, version)
~~~

and hence we don't need the case class and it is easier to use class 
instead of clazz.

For the JSON format, we can use json4s directly and save as `RDD[String]`. 
The code will be cleaner and have less dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887815
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Same here. We can load it as RDD[String] and use json4s to parse it. The 
question is whether we want to treat metadata as a single-record table. It 
might have nested fields, which doesn't look like a table to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887813
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
--- End diff --

`import sqlContext._` is no longer needed due to recent API change. 
`implicit val sqlContext = new SQLContext(sc)` should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
+val metadataArray = metadataRDD.select(clazz, version).take(1)
+assert(metadataArray.size == 1,
+  sUnable to load LogisticRegressionModel metadata from: ${path + 
/metadata})
+metadataArray(0) match {
--- End diff --

The loading part could be more generic. For each model and each version, we 
need to maintain a loader that reads saved models to the current version or 
claim incompatible. For example, 

~~~
object LogisticRegressionModel extends Importable[LogisticRegressionModel] {
  def load(sc path): LogisticRegressionModel = {
val metadata = ...
val importer = Importers.get(version)
importer.load(sc, path)
  }
  object Importers {
 def get(version: String): Importer[LogisticRegressionModel] = {
version match { 
  case 1.0 = new V1()
  case 2.0 = ...
  case _ = throw new Exception(...)
}
 }
class V1 extends Importer[LogisticRegressionModel] {
  ...
}
  }
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23890348
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread dibbhatt
Github user dibbhatt commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23890423
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4193


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-01-31 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23890594
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala 
---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.reflect.{classTag, ClassTag}
+
+import org.apache.spark.{Logging, Partition, SparkContext, SparkException, 
TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.NextIterator
+
+import java.util.Properties
+import kafka.api.{FetchRequestBuilder, FetchResponse}
+import kafka.common.{ErrorMapping, TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import kafka.message.{MessageAndMetadata, MessageAndOffset}
+import kafka.serializer.Decoder
+import kafka.utils.VerifiableProperties
+
+/**
+ * A batch-oriented interface for consuming from Kafka.
+ * Starting and ending offsets are specified in advance,
+ * so that you can control exactly-once semantics.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+ * @param batch Each KafkaRDDPartition in the batch corresponds to a
+ *   range of offsets for a given Kafka topic/partition
+ * @param messageHandler function for translating each message into the 
desired type
+ */
+private[spark]
+class KafkaRDD[
+  K: ClassTag,
+  V: ClassTag,
+  U : Decoder[_]: ClassTag,
+  T : Decoder[_]: ClassTag,
+  R: ClassTag] private[spark] (
+sc: SparkContext,
+kafkaParams: Map[String, String],
+private[spark] val batch: Array[KafkaRDDPartition],
+messageHandler: MessageAndMetadata[K, V] = R
+  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {
+
+  def offsetRanges: Array[OffsetRange] = 
batch.asInstanceOf[Array[OffsetRange]]
+
+  override def getPartitions: Array[Partition] = 
batch.asInstanceOf[Array[Partition]]
+
+  override def getPreferredLocations(thePart: Partition): Seq[String] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+// TODO is additional hostname resolution necessary here
+Seq(part.host)
+  }
+
+  private def errBeginAfterEnd(part: KafkaRDDPartition): String =
+sBeginning offset ${part.fromOffset} is after the ending offset 
${part.untilOffset}  +
+  sfor topic ${part.topic} partition ${part.partition}.  +
+  You either provided an invalid fromOffset, or the Kafka topic has 
been damaged
+
+  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =
+sRan out of messages before reaching ending offset 
${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates that messages may have been 
lost
+
+  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): 
String =
+sGot ${itemOffset}  ending offset ${part.untilOffset}  +
+sfor topic ${part.topic} partition ${part.partition} start 
${part.fromOffset}. +
+ This should not happen, and indicates a message may have been 
skipped
+
+  override def compute(thePart: Partition, context: TaskContext): 
Iterator[R] = {
+val part = thePart.asInstanceOf[KafkaRDDPartition]
+assert(part.fromOffset = part.untilOffset, errBeginAfterEnd(part))
+if (part.fromOffset == part.untilOffset) {
+  log.warn(Beginning offset ${part.fromOffset} is the same as ending 
offset  +
+sskipping ${part.topic} ${part.partition})
+  Iterator.empty
+} else {
+  new KafkaRDDIterator(part, context)
+}
+  }
+
+  private class KafkaRDDIterator(
+  part: 

[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72324906
  
  [Test build #26466 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26466/consoleFull)
 for   PR 3976 at commit 
[`2adc8f5`](https://github.com/apache/spark/commit/2adc8f591ddd0f253496c18d32b1910d29e04c8d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5341] Use maven coordinates as dependen...

2015-01-31 Thread brkyvz
Github user brkyvz commented on the pull request:

https://github.com/apache/spark/pull/4215#issuecomment-72326441
  
Interesting... The tests are successful on my local computer but fails in 
Jenkins... The end to end test that downloads spark-avro and spark-csv succeeds 
which is nice. Searching for artifacts at other repositories looks like it 
failed, but actually it says: Test succeeded, but ended abruptly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72327242
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26466/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72328501
  
@srowen The changes look good to me. Are you going to push more changes? 
I'm thinking of merging this in to avoid merge conflicts with other PRs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-31 Thread GenTang
Github user GenTang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r23890734
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,6 +569,8 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
+# Wait for the information of the just-launched instances to be 
propagated within AWS
+time.sleep(5)
--- End diff --

@nchammas 
Maybe we should insert a print here to tell the client that we are waiting 
for the information to be propagated, since there will be 5 seconds idle 
between `launch master in` and `wait for the cluster to enter`. 
How do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72328806
  
@mengxr That's all I've got for now. It fixes javadoc 8 errors due to the 
code, but not due to the output of unidoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72330127
  
Merged into master. Thanks! I left the JIRA as open.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4983]exception handling about adding ta...

2015-01-31 Thread nchammas
Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/3986#discussion_r23891135
  
--- Diff: ec2/spark_ec2.py ---
@@ -569,6 +569,8 @@ def launch_cluster(conn, opts, cluster_name):
 master_nodes = master_res.instances
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
+# Wait for the information of the just-launched instances to be 
propagated within AWS
+time.sleep(5)
--- End diff --

Sure. Also, in the comments, reference SPARK-4983.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3381] [MLlib] Eliminate bins for unorde...

2015-01-31 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/4231#issuecomment-72331831
  
Thanks :) Looking forward to your reviews. I will work on other stuff till 
then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891545
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/DeveloperApiExample.scala 
---
@@ -0,0 +1,181 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.{SparkConf, SparkContext}
+import org.apache.spark.ml.classification.{Classifier, ClassifierParams, 
ClassificationModel}
+import org.apache.spark.ml.param.{Params, IntParam, ParamMap}
+import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors}
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.sql.{DataFrame, Row, SQLContext}
+
+
+/**
+ * A simple example demonstrating how to write your own learning algorithm 
using Estimator,
+ * Transformer, and other abstractions.
+ * This mimics [[org.apache.spark.ml.classification.LogisticRegression]].
+ * Run with
+ * {{{
+ * bin/run-example ml.DeveloperApiExample
+ * }}}
+ */
+object DeveloperApiExample {
+
+  def main(args: Array[String]) {
+val conf = new SparkConf().setAppName(DeveloperApiExample)
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Prepare training data.
+val training = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.1, 0.1)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.0, -1.0)),
+  LabeledPoint(0.0, Vectors.dense(2.0, 1.3, 1.0)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 1.2, -0.5
+
+// Create a LogisticRegression instance.  This instance is an 
Estimator.
+val lr = new MyLogisticRegression()
+// Print out the parameters, documentation, and any default values.
+println(MyLogisticRegression parameters:\n + lr.explainParams() + 
\n)
+
+// We may set parameters using setter methods.
+lr.setMaxIter(10)
+
+// Learn a LogisticRegression model.  This uses the parameters stored 
in lr.
+val model = lr.fit(training)
+
+// Prepare test data.
+val test = sparkContext.parallelize(Seq(
+  LabeledPoint(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
+  LabeledPoint(0.0, Vectors.dense(3.0, 2.0, -0.1)),
+  LabeledPoint(1.0, Vectors.dense(0.0, 2.2, -1.5
+
+// Make predictions on test data.
+val sumPredictions: Double = model.transform(test)
+  .select(features, label, prediction)
+  .collect()
+  .map { case Row(features: Vector, label: Double, prediction: Double) 
=
+prediction
+  }.sum
+assert(sumPredictions == 0.0,
+  MyLogisticRegression predicted something other than 0, even though 
all weights are 0!)
+
+sc.stop()
+  }
+}
+
+/**
+ * Example of defining a parameter trait for a user-defined type of 
[[Classifier]].
+ *
+ * NOTE: This is private since it is an example.  In practice, you may not 
want it to be private.
+ */
+private trait MyLogisticRegressionParams extends ClassifierParams {
+
+  /**
+   * Param for max number of iterations
+   *
+   * NOTE: The usual way to add a parameter to a model or algorithm is to 
include:
+   *   - val myParamName: ParamType
+   *   - def getMyParamName
+   *   - def setMyParamName
--- End diff --

Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891546
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+
+/**
+ * :: DeveloperApi ::
+ * Params for regression.
+ * Currently empty, but may add functionality later.
+ */
+@DeveloperApi
+trait RegressorParams extends PredictorParams
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label regression
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., 
[[org.apache.spark.mllib.linalg.Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Regressor[
+FeaturesType,
+Learner : Regressor[FeaturesType, Learner, M],
--- End diff --

What about E for Estimator since Learner is not really a concept?  (The 
only weird thing is that the type parameters are E and M, which reminds me of 
EM.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72333480
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891575
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -80,69 +50,157 @@ class LogisticRegression extends 
Estimator[LogisticRegressionModel] with Logisti
 
   def setRegParam(value: Double): this.type = set(regParam, value)
   def setMaxIter(value: Int): this.type = set(maxIter, value)
-  def setLabelCol(value: String): this.type = set(labelCol, value)
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
   override def fit(dataset: SchemaRDD, paramMap: ParamMap): 
LogisticRegressionModel = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
-import dataset.sqlContext._
+
+// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+val oldDataset = extractLabeledPoints(dataset, paramMap)
 val map = this.paramMap ++ paramMap
-val instances = dataset.select(map(labelCol).attr, 
map(featuresCol).attr)
-  .map { case Row(label: Double, features: Vector) =
-LabeledPoint(label, features)
-  }.persist(StorageLevel.MEMORY_AND_DISK)
+val handlePersistence = dataset.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  oldDataset.persist(StorageLevel.MEMORY_AND_DISK)
+}
+
+// Train model
 val lr = new LogisticRegressionWithLBFGS
 lr.optimizer
   .setRegParam(map(regParam))
   .setNumIterations(map(maxIter))
-val lrm = new LogisticRegressionModel(this, map, 
lr.run(instances).weights)
-instances.unpersist()
+val oldModel = lr.run(oldDataset)
+val lrm = new LogisticRegressionModel(this, map, oldModel.weights, 
oldModel.intercept)
+
+if (handlePersistence) {
+  oldDataset.unpersist()
+}
+
 // copy model params
 Params.inheritValues(map, this, lrm)
 lrm
   }
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = true)
-  }
+  override protected def featuresDataType: DataType = new VectorUDT
 }
 
+
 /**
  * :: AlphaComponent ::
+ *
  * Model produced by [[LogisticRegression]].
  */
 @AlphaComponent
 class LogisticRegressionModel private[ml] (
 override val parent: LogisticRegression,
 override val fittingParamMap: ParamMap,
-weights: Vector)
-  extends Model[LogisticRegressionModel] with LogisticRegressionParams {
+val weights: Vector,
+val intercept: Double)
+  extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
+  with LogisticRegressionParams {
+
+  setThreshold(0.5)
 
   def setThreshold(value: Double): this.type = set(threshold, value)
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-  def setScoreCol(value: String): this.type = set(scoreCol, value)
-  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
 
-  private[ml] override def transformSchema(schema: StructType, paramMap: 
ParamMap): StructType = {
-validateAndTransformSchema(schema, paramMap, fitting = false)
+  private val margin: Vector = Double = (features) = {
+BLAS.dot(features, weights) + intercept
+  }
+
+  private val score: Vector = Double = (features) = {
+val m = margin(features)
+1.0 / (1.0 + math.exp(-m))
   }
 
   override def transform(dataset: SchemaRDD, paramMap: ParamMap): 
SchemaRDD = {
+// Check schema
 transformSchema(dataset.schema, paramMap, logging = true)
+
 import dataset.sqlContext._
 val map = this.paramMap ++ paramMap
-val score: Vector = Double = (v) = {
-  val margin = BLAS.dot(v, weights)
-  1.0 / (1.0 + math.exp(-margin))
+
+// Output selected columns only.
+// This is a bit complicated since it tries to avoid repeated 
computation.
--- End diff --

I'll add a TODO for now...but hope I have time to make it part of this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread MartinWeindel
Github user MartinWeindel commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72333732
  
Yes, it's a regression. It worked with 1.2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread MartinWeindel
GitHub user MartinWeindel opened a pull request:

https://github.com/apache/spark/pull/4299

Disabling Utils.chmod700 for Windows

This patch makes Spark 1.2.1rc2 work again on Windows.

Without it you get following log output on creating a Spark context:
INFO  org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster
ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in 
 Ignoring this directory.
ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any 
local dir.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MartinWeindel/spark branch-1.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4299.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4299


commit ac4749c1a2ef180b3ed2b01ff6397c60173a7b33
Author: mweindel m.wein...@usu-software.de
Date:   2015-01-30T16:09:25Z

fixed chmod700 for Windows




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread tomerk
Github user tomerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891560
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+
+/**
+ * :: DeveloperApi ::
+ * Params for regression.
+ * Currently empty, but may add functionality later.
+ */
+@DeveloperApi
+trait RegressorParams extends PredictorParams
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label regression
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., 
[[org.apache.spark.mllib.linalg.Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Regressor[
+FeaturesType,
+Learner : Regressor[FeaturesType, Learner, M],
--- End diff --

Maybe it would also make sense to have FeaturesType be a single letter for 
consistency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-72333635
  
@shivaram  Thanks for the comments!  As far as adding a strongly typed 
protected API later on, we might be able to, but I don't see it as a high 
priority, especially since the DataFrame changes are making it easier to work 
with schema.  What are your thoughts on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72333666
  
(This needs a JIRA) Did this work in 1.2.0? the question will be whether 
it's a regression or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread shivaram
Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/3637#discussion_r23891605
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/Regressor.scala ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import org.apache.spark.annotation.{DeveloperApi, AlphaComponent}
+import org.apache.spark.ml.impl.estimator.{PredictionModel, Predictor, 
PredictorParams}
+
+/**
+ * :: DeveloperApi ::
+ * Params for regression.
+ * Currently empty, but may add functionality later.
+ */
+@DeveloperApi
+trait RegressorParams extends PredictorParams
+
+/**
+ * :: AlphaComponent ::
+ *
+ * Single-label regression
+ *
+ * @tparam FeaturesType  Type of input features.  E.g., 
[[org.apache.spark.mllib.linalg.Vector]]
+ * @tparam Learner  Concrete Estimator type
+ * @tparam M  Concrete Model type
+ */
+@AlphaComponent
+abstract class Regressor[
+FeaturesType,
+Learner : Regressor[FeaturesType, Learner, M],
--- End diff --

E is fine by me. And we could change FeaturesType, but that seems pretty 
descriptive, so I'm okay with leaving that in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-31 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-72335560
  
I guess the more usable DataFrame does make it better. Agree that we can 
revisit this after we write a few more estimators and see how they look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread zhzhan
Github user zhzhan commented on the pull request:

https://github.com/apache/spark/pull/4299#issuecomment-72335760
  
I didn't try 1.2.1, but in 1.2.0, I didn't met such problems in windows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72335894
  
  [Test build #26467 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26467/consoleFull)
 for   PR 4294 at commit 
[`d37b19c`](https://github.com/apache/spark/commit/d37b19c3d1891d077205772c97534fdff04830a1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72337568
  
  [Test build #26467 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26467/consoleFull)
 for   PR 4294 at commit 
[`d37b19c`](https://github.com/apache/spark/commit/d37b19c3d1891d077205772c97534fdff04830a1).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `protected[sql] class DDLException(message: String) extends 
Exception(message)`
  * `trait TableScan extends BaseRelation `
  * `trait PrunedScan extends BaseRelation `
  * `trait PrunedFilteredScan extends BaseRelation `
  * `trait CatalystScan extends BaseRelation `
  * `trait InsertableRelation extends BaseRelation `
  * `case class CreateMetastoreDataSourceAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72337572
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26467/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72337823
  
  [Test build #26468 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26468/consoleFull)
 for   PR 4294 at commit 
[`95a7c71`](https://github.com/apache/spark/commit/95a7c716b8c6864fb4e0c73ec01df854b8475e9f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Disabling Utils.chmod700 for Windows

2015-01-31 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/4299#discussion_r23892404
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -255,12 +255,17 @@ private[spark] object Utils extends Logging {
* @return true if the permissions were successfully changed, false 
otherwise.
*/
   def chmod700(file: File): Boolean = {
-file.setReadable(false, false) 
-file.setReadable(true, true) 
-file.setWritable(false, false) 
-file.setWritable(true, true) 
-file.setExecutable(false, false) 
-file.setExecutable(true, true)
+if (!isWindows) {
+  file.setReadable(false, false) 
+  file.setReadable(true, true) 
+  file.setWritable(false, false) 
+  file.setWritable(true, true) 
+  file.setExecutable(false, false) 
+  file.setExecutable(true, true)
+} else {
+  // this logic does not work for Windows
--- End diff --

Doesn't this comment belong after `if (!isWindows)`, and not in the `else` 
clause that presumably does work for Windows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-31 Thread hammer
Github user hammer commented on the pull request:

https://github.com/apache/spark/pull/4218#issuecomment-72352426
  
@pwendell @rxin we'd love to get this patch into 1.3 to improve the 
scalability of metrics collection. it's a pretty straightforward patch. Could 
you guys give it some Databricks love this week? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4298#discussion_r23895621
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -139,12 +139,13 @@ class DataFrame protected[sql](
*   val rdd: RDD[(Int, String)] = ...
*   rdd.toDataFrame  // this implicit conversion creates a DataFrame 
with column name _1 and _2
*   rdd.toDataFrame(id, name)  // this creates a DataFrame with 
column name id and name
+   *   rdd.toDataFrame(id)  // this creates a DataFrame with 
only column name id
* }}}
*/
   @scala.annotation.varargs
   def toDataFrame(colName: String, colNames: String*): DataFrame = {
 val newNames = colName +: colNames
-require(schema.size == newNames.size,
+require(schema.size = newNames.size,
--- End diff --

I'm worried that silently dropping fields here can cause trouble without 
the user knowing. Can you revert this line?

If we see a lot of requests to make it less strict, we can always relax it 
in the future. But we cannot go the other way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/4294#discussion_r23892984
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -145,6 +145,11 @@ abstract class PrunedFilteredScan extends BaseRelation 
{
  * for experimentation.
  */
 @Experimental
-abstract class CatalystScan extends BaseRelation {
+trait CatalystScan extends BaseRelation {
   def buildScan(requiredColumns: Seq[Attribute], filters: 
Seq[Expression]): RDD[Row]
 }
+
+@DeveloperApi
+trait InsertableRelation extends BaseRelation {
+  def insertInto(data: DataFrame, overwrite: Boolean): Unit
--- End diff --

The method name looks a bit weird when you put them together: 
`relation.insertInto(dataFrame)`, seems that we are inserting a relation into a 
data frame... Maybe just `insert`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4218


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5413] bump metrics dependency to v3.1.0

2015-01-31 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4209#issuecomment-72355239
  
Actually I merged https://github.com/apache/spark/pull/4218 which contains 
this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5422] Add support for sending Graphite ...

2015-01-31 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4218#issuecomment-72355227
  
Merging in master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72340638
  
  [Test build #26468 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26468/consoleFull)
 for   PR 4294 at commit 
[`95a7c71`](https://github.com/apache/spark/commit/95a7c716b8c6864fb4e0c73ec01df854b8475e9f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `val elem = sarray (class $`
  * `val elem = sexternalizable object (class $`
  * `val elem = sobject (class $`
  * `  implicit class ObjectStreamClassMethods(val desc: ObjectStreamClass) 
extends AnyVal `
  * `class IsotonicRegressionModel (`
  * `protected[sql] class DDLException(message: String) extends 
Exception(message)`
  * `trait TableScan extends BaseRelation `
  * `trait PrunedScan extends BaseRelation `
  * `trait PrunedFilteredScan extends BaseRelation `
  * `trait CatalystScan extends BaseRelation `
  * `trait InsertableRelation extends BaseRelation `
  * `case class CreateMetastoreDataSourceAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72346268
  
  [Test build #26469 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26469/consoleFull)
 for   PR 4294 at commit 
[`1a719a5`](https://github.com/apache/spark/commit/1a719a54edbf891f80ea2283ac17a8dd5482e81e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72349756
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26470/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72349753
  
  [Test build #26470 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26470/consoleFull)
 for   PR 3976 at commit 
[`47d2fc3`](https://github.com/apache/spark/commit/47d2fc35e53a8851790607085bc67e94736358d6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72340640
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26468/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-31 Thread harishreedharan
Github user harishreedharan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4292#discussion_r23893987
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
 assertNotStopped()
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
--- End diff --

Since security is supported only in Yarn mode, this should be fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72348214
  
  [Test build #26469 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26469/consoleFull)
 for   PR 4294 at commit 
[`1a719a5`](https://github.com/apache/spark/commit/1a719a54edbf891f80ea2283ac17a8dd5482e81e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `val elem = sarray (class $`
  * `val elem = sexternalizable object (class $`
  * `val elem = sobject (class $`
  * `  implicit class ObjectStreamClassMethods(val desc: ObjectStreamClass) 
extends AnyVal `
  * `class IsotonicRegressionModel (`
  * `protected[sql] class DDLException(message: String) extends 
Exception(message)`
  * `trait TableScan extends BaseRelation `
  * `trait PrunedScan extends BaseRelation `
  * `trait PrunedFilteredScan extends BaseRelation `
  * `trait CatalystScan extends BaseRelation `
  * `trait InsertableRelation extends BaseRelation `
  * `case class CreateMetastoreDataSourceAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-5501][SQL] Write support for the d...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4294#issuecomment-72348215
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26469/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4382] Add locations parameter to Twitte...

2015-01-31 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3246#issuecomment-72351744
  
@srowen Does @tdas have time to look at this? If no, maybe others can?






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5413] bump metrics dependency to v3.1.0

2015-01-31 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4209#issuecomment-72355173
  
@andrewor14 are you going to merge this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72348373
  
  [Test build #26470 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26470/consoleFull)
 for   PR 3976 at commit 
[`47d2fc3`](https://github.com/apache/spark/commit/47d2fc35e53a8851790607085bc67e94736358d6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-31 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4047#issuecomment-72348800
  
*Update on tests*

Summary:
* On a small dataset (20 newsgroups), it seems to work fine (on my laptop).
* On a big dataset (Wikipedia dump with close to 1 billion tokens), it's 
been hard to get it to run for more than 10 or 20 iterations (on a 16-node EC2 
cluster).

Details:

Small dataset: You can see the output here: 
[https://github.com/jkbradley/spark/blob/lda-tmp/20news.lda.out].  The log 
likelihood improves with each iteration, and iteration running times stay about 
the same throughout training.  The topics are really nicely divided among the 
newsgroups.  (But I did run this using 20 topics.)  I used 100 iterations and 
the stopwords mentioned above.

Large dataset: Even with checkpointing, it has been hard to run for many 
iterations, mainly because of shuffle files and checkpoint files building up.  
I need to spend some more time running tests.  Currently, the results on the 
Wikipedia dump do not look good; topics are pretty much all the same.  It is 
unclear if this is because of poor convergence, a need for parameter tuning, a 
need for supporting sparsity as mentioned above (which might help to force 
topics to differentiate), or a need for better initialization (since EM can 
have lots of trouble with LDA's many local minima).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang closed the pull request at:

https://github.com/apache/spark/pull/3976


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread lianhuiwang
GitHub user lianhuiwang reopened a pull request:

https://github.com/apache/spark/pull/3976

[SPARK-5173]support python application running on yarn cluster mode

now when we run python application on yarn cluster mode through 
spark-submit, spark-submit does not support python application on yarn cluster 
mode. so i modify code of submit and yarn's AM in order to support it.
through specifying .py file or primaryResource file via spark-submit, we 
can make pyspark run in yarn-cluster mode.
example:spark-submit --master yarn-master --num-executors 1 --driver-memory 
1g --executor-memory 1g  xx.py --primaryResource yy.conf
this config is same as pyspark on yarn-client mode.
firstly,we put local path of .py or primaryResource to yarn's 
dist.files.that can be distributed on slave nodes.and then in spark-submit we 
transfer --py-files and --primaryResource to yarn.Client and use 
org.apache.spark.deploy.PythonRunner to user class that can run .py files on 
ApplicationMaster.
in yarn.Client we transfer --py-files and --primaryResource to  
ApplicationMaster.
in ApplicationMaster, user's class is org.apache.spark.deploy.PythonRunner, 
and user's args is primaryResource and -py-files. so that can make pyspark run 
on ApplicationMaster.
@JoshRosen @tgravescs @sryza

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lianhuiwang/spark SPARK-5173

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3976


commit 9c941bc59527e594ee1d155c00cb8e55d7c40fe8
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-09T12:58:24Z

support python application running on yarn cluster mode

commit 172eec10b9daaf9ed838e821474d28871ab63462
Author: Wang Lianhui lianhuiwan...@gmail.com
Date:   2015-01-09T15:01:52Z

fix a min submit's bug

commit f1f55b6eb4b65499be8e182e857d89a158873234
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-29T11:13:35Z

when yarn-cluster, all python files can be non-local

commit 905a10610532578c774e58d12b927597330fb9ff
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-31T03:29:09Z

update with sryza and andrewor 's comments

commit 097a5ec37456bf9d13a952f4108a750b9f9f84d0
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-31T03:59:06Z

fix line length exceeds 100

commit 5b300648fe53d9de604e8afce7580fddfe6bbaef
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-31T12:18:22Z

add test

commit d60bc6069cf65637622472ef1cd2715df53c
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-31T14:07:03Z

fix test

commit 2adc8f591ddd0f253496c18d32b1910d29e04c8d
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-01-31T16:35:01Z

add spark.test.home

commit 47d2fc35e53a8851790607085bc67e94736358d6
Author: lianhuiwang lianhuiwan...@gmail.com
Date:   2015-02-01T02:40:25Z

fix test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23888056
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
--- End diff --

I tried that, and it fails to compile.  (I tried removed sqlContext from 
save(), as well as having it be an implicit val without the import.  Neither 
worked.)  Is there another import I need?
```
[error] 
/Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:89:
 type mismatch;
[error]  found   : 
org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Metadata]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error] val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
[error]^
[error] 
/Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:94:
 type mismatch;
[error]  found   : 
org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Data]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error] val dataRDD: DataFrame = sc.parallelize(Seq(data))
[error]^
[error] two errors found
[error] (mllib/compile:compile) Compilation failed
```
It may not be an issue though, with the other change you suggested below.  
I'll see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72313229
  
  [Test build #26459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26459/consoleFull)
 for   PR 4233 at commit 
[`638fa81`](https://github.com/apache/spark/commit/638fa81e40ef64558a6de70b4ad669fb63c46ada).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Exportable `
  * `trait Importable[Model : Exportable] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72313232
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26459/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72311863
  
  [Test build #26459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26459/consoleFull)
 for   PR 4233 at commit 
[`638fa81`](https://github.com/apache/spark/commit/638fa81e40ef64558a6de70b4ad669fb63c46ada).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72313729
  
  [Test build #26460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26460/consoleFull)
 for   PR 4193 at commit 
[`6ffb27f`](https://github.com/apache/spark/commit/6ffb27fde0224f08c7679d9e645ae1cbca1b1e98).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Refactor DataFrame related codes

2015-01-31 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/4298

[Minor][SQL] Refactor DataFrame related codes

Simplify some codes related to DataFrame. 

*  Calling `toAttributes` instead of a `map`.
*  Original `createDataFrame` creates the `StructType` and its attributes 
in a redundant way. Refactored it to create `StructType` and call 
`toAttributes` on it directly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 refactor_df

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4298


commit 2c9f370b312d241e6c72abe812614ac1590a032f
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2015-01-31T11:58:36Z

Just refactor DataFrame codes.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72315407
  
  [Test build #26460 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26460/consoleFull)
 for   PR 4193 at commit 
[`6ffb27f`](https://github.com/apache/spark/commit/6ffb27fde0224f08c7679d9e645ae1cbca1b1e98).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72315410
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26460/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SQL] Little refactor DataFrame related...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4298#issuecomment-72315508
  
  [Test build #26461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26461/consoleFull)
 for   PR 4298 at commit 
[`2c9f370`](https://github.com/apache/spark/commit/2c9f370b312d241e6c72abe812614ac1590a032f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5212][SQL] Add support of schema-less, ...

2015-01-31 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4014#issuecomment-72315535
  
@rxin I have added the explanation for this feature. Would you have time to 
review this pr and see if it is ok to merge? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3976#issuecomment-72315747
  
  [Test build #26462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26462/consoleFull)
 for   PR 3976 at commit 
[`5b30064`](https://github.com/apache/spark/commit/5b300648fe53d9de604e8afce7580fddfe6bbaef).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5173]support python application running...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/3976#discussion_r23888565
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -430,6 +430,10 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments,
   private def startUserClass(): Thread = {
 logInfo(Starting the user JAR in a separate Thread)
--- End diff --


https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L451
@andrewor14  here is not outdated. this is same to master's code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3359 [CORE] [DOCS] `sbt/sbt unidoc` does...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4193#issuecomment-72316297
  
  [Test build #26463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26463/consoleFull)
 for   PR 4193 at commit 
[`5b33f66`](https://github.com/apache/spark/commit/5b33f66424155bc4d3d58315a4ef2168177b802f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org