[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38653629
  
@yinxusen It looks good to me now. @rxin , could you please make a final 
scan and merge it if it looks good to you? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10965746
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.ByteStreams;
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a 
single whole text file out in a
+ * key-value pair, where the key is the file path and the value is the 
entire content of the file.
+ */
+public class WholeTextFileRecordReader extends RecordReaderString, Text {
+  private Path path;
+
+  private String key = null;
+  private Text value = null;
+
+  private boolean processed = false;
+
+  private FileSystem fs;
+
+  public WholeTextFileRecordReader(
+  CombineFileSplit split,
+  TaskAttemptContext context,
+  Integer index)
+throws IOException {
+path = split.getPath(index);
+fs = path.getFileSystem(context.getConfiguration());
+  }
+
+  @Override
+  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+  }
+
+  @Override
+  public void close() throws IOException {
+  }
+
+  @Override
+  public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+  }
+
+  @Override
+  public String getCurrentKey() throws IOException, InterruptedException {
+return key;
+  }
+
+  @Override
+  public Text getCurrentValue() throws IOException, InterruptedException{
--- End diff --

add a space after InterruptedException


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10965749
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.ByteStreams;
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a 
single whole text file out in a
+ * key-value pair, where the key is the file path and the value is the 
entire content of the file.
+ */
+public class WholeTextFileRecordReader extends RecordReaderString, Text {
+  private Path path;
+
+  private String key = null;
+  private Text value = null;
+
+  private boolean processed = false;
--- End diff --

can u add some inline comment to explain what processed means? You don't 
want the reader to go through nextKeyValue to find out what it does ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10965782
  
--- Diff: project/SparkBuild.scala ---
@@ -358,7 +358,7 @@ object SparkBuild extends Build {
   def mllibSettings = sharedSettings ++ Seq(
 name := spark-mllib,
 libraryDependencies ++= Seq(
-  org.jblas % jblas % 1.2.3
+  org.jblas% jblas  % 1.2.3
--- End diff --

Let's not change this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10965793
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.input.WholeTextFileInputFormat
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkContext._
+
+/**
+ * Extra functions available on SparkContext of mllib through an implicit 
conversion. Import
+ * `org.apache.spark.mllib.MLContext._` at the top of your program to use 
these functions.
+ */
+class MLContext(self: SparkContext) {
+
+  /**
+   * Read a directory of text files from HDFS, a local file system 
(available on all nodes), or any
--- End diff --

An example here would be great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10965949
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala
 ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+import java.io.DataOutputStream
+import java.io.FileOutputStream
+import java.nio.file.Files
--- End diff --

Does nio.file.Files exist on JDK 6? If not... let's change it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10966132
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java 
---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
--- End diff --

Never mind, I'd like to do that. Let me try to change them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38667186
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38667187
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38671709
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13468/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38671708
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10993561
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import org.apache.spark.mllib.input.WholeTextFileInputFormat
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+
+/**
+ * Extra functions available on SparkContext of mllib through an implicit 
conversion. Import
+ * `org.apache.spark.mllib.MLContext._` at the top of your program to use 
these functions.
+ */
+class MLContext(self: SparkContext) {
+
+  /**
+   * Read a directory of text files from HDFS, a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI. For example,
+   *   {{{
+   *   hdfs://a-hdfs-path/part-0
--- End diff --

Try this ...
```
/* If you have the following files:
hdfs://a-hdfs-path/part-0
hdfs://a-hdfs-path/part-1
...
hdfs://a-hdfs-path/part-n
*/

val rdd = mlContext.wholeTextFile(hdfs://a-hdfs-path)
// RDD contains
//     explain what the content looks like
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10993628
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/input/WholeTextFileRecordReader.scala
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input
+
+import com.google.common.io.{ByteStreams, Closeables}
+
+import org.apache.hadoop.io.Text
+import org.apache.hadoop.mapreduce.InputSplit
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
+import org.apache.hadoop.mapreduce.RecordReader
+import org.apache.hadoop.mapreduce.TaskAttemptContext
+
+/**
+ * A [[org.apache.hadoop.mapreduce.RecordReader RecordReader]] for reading 
a single whole text file
+ * out in a key-value pair, where the key is the file path and the value 
is the entire content of
+ * the file.
+ */
+private[mllib] class WholeTextFileRecordReader(
+split: CombineFileSplit,
+context: TaskAttemptContext,
+index: Integer)
+  extends RecordReader[String, String] {
+
+  private val path = split.getPath(index)
+  private val fs = path.getFileSystem(context.getConfiguration)
+
+  // True means the current file has been processed, then skip it.
+  private var processed = false
+
+  private val key = path.toString
+  private var value: String = null
+
+  override def initialize(split: InputSplit, context: TaskAttemptContext) 
= {}
+
+  override def close() = {}
+
+  override def getProgress = if(processed) 1.0f else 0.0f
+
+  override def getCurrentKey = key
+
+  override def getCurrentValue = value
+
+  override def nextKeyValue = {
+if(!processed) {
--- End diff --

add a space after if


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38727263
  
lgtm other than the couple minor comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38756255
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38756257
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38761355
  
Somehow this is no longer mergeable. Do you mind bring it up to date? 
Everything looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38761534
  
Sure, let me update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38765336
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38765347
  
Oh... Is that OK? That's strange...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38769629
  
Yea - perhaps apply it to trunk. Feel free to create another PR if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38769753
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13493/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-26 Thread yinxusen
Github user yinxusen closed the pull request at:

https://github.com/apache/spark/pull/164


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38562620
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38562621
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38568166
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38568167
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13432/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10954776
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.input.WholeTextFileInputFormat
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkContext._
+
+/**
+ * Extra functions available on SparkContext of mllib through an implicit 
conversion. Import
+ * `org.apache.spark.mllib.MLContext._` at the top of your program to use 
these functions.
+ */
+class MLContext(self: SparkContext) extends Logging {
+
+  /**
+   * Read a directory of text files from HDFS, a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI. Each file is read as a single 
record and returned in a
+   * key-value pair, where the key is the path of each file, and the value 
is the content of each
+   * file.
+   */
+  def wholeTextFile(path: String): RDD[(String, String)] = {
+self.newAPIHadoopFile(
+  path,
+  classOf[WholeTextFileInputFormat],
+  classOf[String],
+  classOf[Text]).mapValues(_.toString)
+  }
+}
+
+
+/**
+ * The MLContext object contains a number of implicit conversions and 
parameters for use with
+ * various mllib features.
+ */
+object MLContext extends Logging {
+  implicit def mlContextToSparkContext(sc: SparkContext) = new 
MLContext(sc)
--- End diff --

sparkContextToMLContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10955063
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala ---
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.Logging
+import org.apache.spark.mllib.input.WholeTextFileInputFormat
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkContext._
+
+/**
+ * Extra functions available on SparkContext of mllib through an implicit 
conversion. Import
+ * `org.apache.spark.mllib.MLContext._` at the top of your program to use 
these functions.
+ */
+class MLContext(self: SparkContext) extends Logging {
--- End diff --

Logging is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38643350
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10873319
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java 
---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk into pair 
(filename, content) format.
+ * It will be called by HadoopRDD to generate new 
WholeTextFileRecordReader.
--- End diff --

Sorry I forgot this is Java. Then you should use `@link` instead of `[[]]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10877199
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading 
whole text file out in
+ * (filename, content) format. Each element in split is an record of a 
unique, whole file. File name
+ * is the full path name for easy deduplicate.
+ */
+public class WholeTextFileRecordReader extends RecordReaderString, Text {
+  private Path path;
+
+  private String key = null;
+  private Text value = null;
+
+  private boolean processed = false;
+
+  private FileSystem fs;
+
+  public WholeTextFileRecordReader(
+  CombineFileSplit split,
+  TaskAttemptContext context,
+  Integer index)
+throws IOException {
+path = split.getPath(index);
+fs = path.getFileSystem(context.getConfiguration());
+  }
+
+  @Override
+  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+  }
+
+  @Override
+  public void close() throws IOException {
+  }
+
+  @Override
+  public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+  }
+
+  @Override
+  public String getCurrentKey() throws IOException, InterruptedException {
+return key;
+  }
+
+  @Override
+  public Text getCurrentValue() throws IOException, InterruptedException{
+return value;
+  }
+
+  @Override
+  public boolean nextKeyValue() throws IOException {
+if (!processed) {
+  if (key == null) {
+key = path.toString();
+  }
+  if (value == null) {
+value = new Text();
+  }
+
+  FSDataInputStream fileIn = null;
+  try {
+fileIn = fs.open(path);
+byte[] innerBuffer = IOUtils.toByteArray(fileIn);
--- End diff --

@mengxr @yinxusen PS I am happy to take on removal of use of commons-io in 
favor of equivalents in Guava. There is a bit more than this usage, but it's 
easy stuff. Commons IO is fine but it's not necessary to use it here and it is 
one of those dependencies that could collide with other versions in Hadoop. If 
anyone nods I'll open a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38436606
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38436605
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38441593
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10912844
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading 
whole text file out in
--- End diff --

change `code` to `{@link `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10912852
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading 
whole text file out in
+ * (filename, content) format. Each element in split is an record of a 
unique, whole file. File name
--- End diff --

(path, content)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10912943
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -120,4 +122,17 @@ object MLUtils {
 }
 sum
   }
+
+  /**
+   * Read a directory of text files from HDFS, a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI, and return it as an RDD of 
(filename, content) both in String
--- End diff --

(path, content)

Also need to mention the entire file is read as a single record. See 
`WholeTextInputFormat` for reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38518467
  
@yinxusen I think the code looks good to me except a few docs. I'm not sure 
whether `MLUtils` is a good place for `wholeTextFile`. We can pimp 
`SparkContext` and put this method there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38519654
  
@mengxr I talked to @liancheng about the placement of WholeTextFiles
interface, we have no idea of whether it is a commonly used interface or
not at that time, so we choose the MLUtils instead of SparkContexts in
cautious.

If the community thinks it is good to put it in SparkContexts, I am glad to
do that. :)

@yinxusen I think the code looks good to me except a few docs. I'm not sure
whether MLUtils is a good place for wholeTextFile. We can pimp SparkContext
and put this method there.

--
Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38520669
  
@yinxusen I didn't mean putting it under `SparkContext` directly, but 
something like `MLContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-24 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38524328
  
Yes, we can put it in MLContext and add an implicit conversion from 
SparkContext to MLContext, so that the API can be added in a non-intrusive 
manner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865876
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -17,12 +17,14 @@
 
 package org.apache.spark.mllib.util
 
-import org.apache.spark.SparkContext
-import org.apache.spark.rdd.RDD
-import org.apache.spark.SparkContext._
-
+import org.apache.hadoop.io.Text
--- End diff --

okay


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865885
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -120,4 +122,21 @@ object MLUtils {
 }
 sum
   }
+
+  /**
+   * Reads a bunch of whole files from HDFS, or a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI, and return an RDD[(String, String)].
+   *
+   * @param path The directory you should specified, such as
+   * hdfs://[address]:[port]/[dir]
+   *
+   * @return The first String is a file name, the second one is its 
content.
--- End diff --

Need to improve the doc. See my comment on `WholeTextInputFormat` and the 
doc of `textFile` as a reference.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865893
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.DataOutputStream
+import java.io.FileOutputStream
+import java.nio.file.Files
+import java.nio.file.Path
+import java.nio.file.Paths
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests the correctness of 
[[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
+ * directory is created as fake input. Temporal storage would be deleted 
in the end.
+ */
+class WholeTextFileRecordReaderSuite extends FunSuite with 
BeforeAndAfterAll {
+  private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+  }
+
+  override def afterAll() {
+sc.stop()
+  }
+
+  private def createNativeFile(inputDir: Path, fileName: String, contents: 
Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
--- End diff --

remove println


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865898
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.DataOutputStream
+import java.io.FileOutputStream
+import java.nio.file.Files
+import java.nio.file.Path
+import java.nio.file.Paths
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests the correctness of 
[[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
+ * directory is created as fake input. Temporal storage would be deleted 
in the end.
+ */
+class WholeTextFileRecordReaderSuite extends FunSuite with 
BeforeAndAfterAll {
+  private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+  }
+
+  override def afterAll() {
+sc.stop()
+  }
+
+  private def createNativeFile(inputDir: Path, fileName: String, contents: 
Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
+  }
+
+  /**
+   * This code will test the behaviors of WholeTextFileRecordReader based 
on local disk. There are
+   * three aspects to check:
+   *   1) is all files are read.
+   *   2) is the fileNames are read correctly.
--- End diff --

whether paths are read correctly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865907
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.DataOutputStream
+import java.io.FileOutputStream
+import java.nio.file.Files
+import java.nio.file.Path
+import java.nio.file.Paths
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests the correctness of 
[[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
+ * directory is created as fake input. Temporal storage would be deleted 
in the end.
+ */
+class WholeTextFileRecordReaderSuite extends FunSuite with 
BeforeAndAfterAll {
+  private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+  }
+
+  override def afterAll() {
+sc.stop()
+  }
+
+  private def createNativeFile(inputDir: Path, fileName: String, contents: 
Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
+  }
+
+  /**
+   * This code will test the behaviors of WholeTextFileRecordReader based 
on local disk. There are
+   * three aspects to check:
+   *   1) is all files are read.
+   *   2) is the fileNames are read correctly.
+   *   3) is the contents must be the same.
+   */
+  test(Correctness of WholeTextFileRecordReader.) {
+
+val dir = Files.createTempDirectory(wholefiles)
+println(snative disk address is ${dir.toString})
+
+WholeTextFileRecordReaderSuite.fileNames
+  .zip(WholeTextFileRecordReaderSuite.filesContents)
+  .foreach { case (fname, contents) =
+createNativeFile(dir, fname, contents)
+}
+
+val res = wholeTextFile(sc, dir.toString).collect()
+
+assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size,
+  Number of files read out do not fit with the actual value)
--- End diff --

put res.size and fileNames.size into error message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10865913
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala
 ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.DataOutputStream
+import java.io.FileOutputStream
+import java.nio.file.Files
+import java.nio.file.Path
+import java.nio.file.Paths
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests the correctness of 
[[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary
+ * directory is created as fake input. Temporal storage would be deleted 
in the end.
+ */
+class WholeTextFileRecordReaderSuite extends FunSuite with 
BeforeAndAfterAll {
+  private var sc: SparkContext = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+  }
+
+  override def afterAll() {
+sc.stop()
+  }
+
+  private def createNativeFile(inputDir: Path, fileName: String, contents: 
Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
+  }
+
+  /**
+   * This code will test the behaviors of WholeTextFileRecordReader based 
on local disk. There are
+   * three aspects to check:
+   *   1) is all files are read.
+   *   2) is the fileNames are read correctly.
+   *   3) is the contents must be the same.
+   */
+  test(Correctness of WholeTextFileRecordReader.) {
+
+val dir = Files.createTempDirectory(wholefiles)
+println(snative disk address is ${dir.toString})
+
+WholeTextFileRecordReaderSuite.fileNames
+  .zip(WholeTextFileRecordReaderSuite.filesContents)
+  .foreach { case (fname, contents) =
+createNativeFile(dir, fname, contents)
+}
+
+val res = wholeTextFile(sc, dir.toString).collect()
+
+assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size,
+  Number of files read out do not fit with the actual value)
+
+for ((fname, contents) - res) {
+  val shortName = fname.split('/').last
+  assert(WholeTextFileRecordReaderSuite.fileNames.contains(shortName),
+sMissing file name $fname.)
+  assert(contents.hashCode === 
WholeTextFileRecordReaderSuite.hashCodeOfContents(shortName),
--- End diff --

verify text directly instead of hashcode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10871469
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java 
---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk into pair 
(filename, content) format.
+ * It will be called by HadoopRDD to generate new 
WholeTextFileRecordReader.
--- End diff --

\code\org.apache.hadoop.mapreduce.lib.CombineFileInputFormat\/code\ 
should be used, I think. Java code file could not use [[]].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38409137
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38409136
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38410992
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13379/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38410991
  
Build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38346619
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38346620
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38347601
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38253432
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10828986
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java 
---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk. It will be 
called by
+ * HadoopRDD to generate new WholeTextFileRecordReader.
--- End diff --

Need definition of `key` and `value`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10829033
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in filename, content format.
+ */
+
+public class WholeTextFileRecordReader extends RecordReaderString, Text {
+  private Path path;
+
+  private String key = null;
+  private Text value = null;
+
+  private boolean processed = false;
+
+  private FileSystem fs;
+
+  public WholeTextFileRecordReader(
+  CombineFileSplit split,
+  TaskAttemptContext context,
+  Integer index)
+throws IOException {
+path = split.getPath(index);
+fs = path.getFileSystem(context.getConfiguration());
+  }
+
+  @Override
+  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+  }
+
+  @Override
+  public void close() throws IOException {
+  }
+
+  @Override
+  public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+  }
+
+  @Override
+  public String getCurrentKey() throws IOException, InterruptedException {
+return key;
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10829067
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java 
---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.commons.io.IOUtils;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in filename, content format.
+ */
+
+public class WholeTextFileRecordReader extends RecordReaderString, Text {
+  private Path path;
+
+  private String key = null;
+  private Text value = null;
+
+  private boolean processed = false;
+
+  private FileSystem fs;
+
+  public WholeTextFileRecordReader(
--- End diff --

Need doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10829200
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -120,4 +122,22 @@ object MLUtils {
 }
 sum
   }
+
+  /**
+   * Reads a bunch of whole files from HDFS, or a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI, and return an RDD[(String, String)].
+   *
+   * @param path The directory you should specified, such as
+   * hdfs://[address]:[port]/[dir]
+   *
+   * @return RDD[(fileName: String, content: String)]
+   * i.e. the first is the file name of a file, the second one is 
its content.
--- End diff --

do not need to repeat `RDD[(fileName: String, content: String)]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10829741
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.{BufferedReader, DataOutputStream, FileOutputStream, 
InputStreamReader}
+import java.nio.file.Files
+import java.nio.file.{Path = JPath}
+import java.nio.file.{Paths = JPaths}
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hdfs.MiniDFSCluster
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS 
tests create a mock DFS in
+ * memory, while local disk test create a temp directory. All these 
temporal storages are deleted
+ * in the end.
+ */
+
+class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll {
--- End diff --

Indeed, we only need to write a unit test for `WholeTextFileRecordReader`. 
It is not necessary to verify that  `wholeTextFile` can load files from HDFS. 
We are actually testing `newAPIHadoopFile` here. Maybe we can get some 
suggestions from others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-21 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10829857
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.{BufferedReader, DataOutputStream, FileOutputStream, 
InputStreamReader}
+import java.nio.file.Files
+import java.nio.file.{Path = JPath}
+import java.nio.file.{Paths = JPaths}
+
+import scala.collection.immutable.IndexedSeq
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.hdfs.MiniDFSCluster
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS 
tests create a mock DFS in
+ * memory, while local disk test create a temp directory. All these 
temporal storages are deleted
+ * in the end.
+ */
+
+class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll {
--- End diff --

Yep, it will be much easy if we just test `WholeTextFileRecordReader` 
alone. I can construct a `CombineFileSplit` and test it. The correctness of 
reading files from HDFS and native disk is guaranteed by other tests, I think. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38143992
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10785909
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
--- End diff --

You are reading the entire file, which is not splitable. Should offset 
always be 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10785949
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
+length = (int) split.getLength(index);
+fs = path.getFileSystem(context.getConfiguration());
+}
+
+@Override
+public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+}
+
+@Override
+public void close() throws IOException {
+}
+
+@Override
+public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
--- End diff --

This is not correct because there might be multiple files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10785965
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
+length = (int) split.getLength(index);
+fs = path.getFileSystem(context.getConfiguration());
+}
+
+@Override
+public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+}
+
+@Override
+public void close() throws IOException {
+}
+
+@Override
+public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+}
+
+@Override
+public String getCurrentKey() throws IOException, InterruptedException 
{
+return key;
+}
+
+@Override
+public Text getCurrentValue() throws IOException, InterruptedException{
+return value;
+}
+
+@Override
+public boolean nextKeyValue() throws IOException {
+if (!processed) {
--- End diff --

So it only reads the first file and throws away the rest?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786005
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
+length = (int) split.getLength(index);
+fs = path.getFileSystem(context.getConfiguration());
+}
+
+@Override
+public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+}
+
+@Override
+public void close() throws IOException {
+}
+
+@Override
+public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+}
+
+@Override
+public String getCurrentKey() throws IOException, InterruptedException 
{
+return key;
+}
+
+@Override
+public Text getCurrentValue() throws IOException, InterruptedException{
+return value;
+}
+
+@Override
+public boolean nextKeyValue() throws IOException {
+if (!processed) {
+if (key == null) {
+key = path.getName();
+}
+if (value == null) {
+value = new Text();
+}
+
+FSDataInputStream fileIn = null;
+try {
+fileIn = fs.open(path);
+fileIn.seek(startOffset);
+byte[] innerBuffer = new byte[length];
+IOUtils.readFully(fileIn, innerBuffer, 0, length);
--- End diff --

use `IOUtils.toByteArray(fileIn)` for simplicity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786181
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -120,4 +122,22 @@ object MLUtils {
 }
 sum
   }
+
+  /**
+   * Reads a bunch of small files from HDFS, or a local file system 
(available on all nodes), or any
+   * Hadoop-supported file system URI, and return an RDD[(String, String)].
+   *
+   * @param path The directory you should specified, such as
+   * hdfs://[address]:[port]/[dir]
+   *
+   * @return RDD[(fileName: String, content: String)]
+   * i.e. the first is the file name of a file, the second one is 
its content.
+   */
+  def smallTextFiles(sc: SparkContext, path: String): RDD[(String, 
String)] = {
--- End diff --

The current API uses `textFile` to load text files. I suggest using 
`wholeTextFile` to load the entire content of each file. `small` is not a good 
name here because it is not defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786231
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
+length = (int) split.getLength(index);
+fs = path.getFileSystem(context.getConfiguration());
+}
+
+@Override
+public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+}
+
+@Override
+public void close() throws IOException {
+}
+
+@Override
+public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+}
+
+@Override
+public String getCurrentKey() throws IOException, InterruptedException 
{
+return key;
+}
+
+@Override
+public Text getCurrentValue() throws IOException, InterruptedException{
+return value;
+}
+
+@Override
+public boolean nextKeyValue() throws IOException {
+if (!processed) {
--- End diff --

No, it doesn't. One instance of `BatchFilesRecordReader` just processes 
only one file. In the constructor of this class, it passes an `Integer index` 
in, which represents the current file in the file sets. Same reason for the 
above question, i.e. the `getProcess()` function..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786403
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.{InputStreamReader, BufferedReader, DataOutputStream, 
FileOutputStream}
+import java.nio.file.Files
+import java.nio.file.{Path = JPath}
+import java.nio.file.{Paths = JPaths}
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.hdfs.MiniDFSCluster
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS 
tests create a mock DFS in
+ * memory, while local disk test create a temp directory. All these 
temporal storages are deleted
+ * in the end.
+ */
+
+class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
+  private var sc: SparkContext = _
+  private var dfs: MiniDFSCluster = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+sc.hadoopConfiguration.set(dfs.datanode.data.dir.perm, 
SmallTextFilesSuite.dirPermission())
+dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
+ Array(/rack0, /rack0, /rack1, /rack1),
+ Array(host0, host1, host2, host3))
+  }
+
+  override def afterAll() {
+if (dfs != null) dfs.shutdown()
+sc.stop()
+System.clearProperty(spark.driver.port)
+  }
+
+
+  private def createHDFSFile(
+  fs: FileSystem,
+  inputDir: Path,
+  fileName: String,
+  contents: Array[Byte]) = {
+val out: DataOutputStream = fs.create(new Path(inputDir, fileName), 
true, 4096, 2, 512, null)
+out.write(contents, 0, contents.length)
+out.close()
+System.out.println(Wrote HDFS file)
+  }
+
+  /**
+   * This code will test the behaviors on HDFS. There are three aspects to 
test:
+   *1) is all files are read.
+   *2) is the fileNames are read correctly.
+   *3) is the contents must be the same.
+   */
+  test(Small file input || HDFS IO) {
+val fs: FileSystem = dfs.getFileSystem
+val dir = /foo/
+val inputDir: Path = new Path(dir)
+
+
SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
+  case (fname, contents) =
+createHDFSFile(fs, inputDir, fname, contents)
+}
+
+println(sname node is 
${dfs.getNameNode.getNameNodeAddress.getHostName})
+println(sname node port is ${dfs.getNameNodePort})
+
+val hdfsAddressDir =
+  
shdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir
+println(sHDFS address dir is $hdfsAddressDir)
+
+val res = smallTextFiles(sc, hdfsAddressDir).collect()
+
+assert(res.size === SmallTextFilesSuite.fileNames.size,
+  Number of files read out do not fit with the actual value)
+
+for ((fname, contents) - res) {
+  assert(SmallTextFilesSuite.fileNames.contains(fname),
+sMissing file name $fname.)
+  assert(contents.hashCode === 
SmallTextFilesSuite.hashCodeOfContents(fname),
+sfile $fname contents can not match)
+}
+  }
+
+  private def createNativeFile(inputDir: JPath, fileName: String, 
contents: Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
+  }
+
+  /**
+   * This code will test the behaviors on native file 

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786514
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.input;
+
+import java.io.IOException;
+
+import com.google.common.io.Closeables;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
+ */
+
+public class BatchFilesRecordReader extends RecordReaderString, Text {
+private long startOffset;
+private int length;
+private Path path;
+
+private String key = null;
+private Text value = null;
+
+private boolean processed = false;
+
+private FileSystem fs;
+
+public BatchFilesRecordReader(
+CombineFileSplit split,
+TaskAttemptContext context,
+Integer index)
+throws IOException {
+path = split.getPath(index);
+startOffset = split.getOffset(index);
+length = (int) split.getLength(index);
+fs = path.getFileSystem(context.getConfiguration());
+}
+
+@Override
+public void initialize(InputSplit arg0, TaskAttemptContext arg1)
+throws IOException, InterruptedException {
+}
+
+@Override
+public void close() throws IOException {
+}
+
+@Override
+public float getProgress() throws IOException {
+return processed ? 1.0f : 0.0f;
+}
+
+@Override
+public String getCurrentKey() throws IOException, InterruptedException 
{
+return key;
+}
+
+@Override
+public Text getCurrentValue() throws IOException, InterruptedException{
+return value;
+}
+
+@Override
+public boolean nextKeyValue() throws IOException {
+if (!processed) {
--- End diff --

Okay. I misread the type of BatchFilesRecordReader. Its name is a little 
confusing because it only reads one file per instance. I still recommend using 
`WholeTextFileRecordReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38146292
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38146291
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38146374
  
Btw, please fix style issues. We use 2-space indentation in Spark, and 
there are more at the Spark Code Style page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10786751
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util
+
+
+import java.io.{InputStreamReader, BufferedReader, DataOutputStream, 
FileOutputStream}
+import java.nio.file.Files
+import java.nio.file.{Path = JPath}
+import java.nio.file.{Paths = JPaths}
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.FunSuite
+
+import org.apache.hadoop.hdfs.MiniDFSCluster
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.Text
+
+import org.apache.spark.mllib.util.MLUtils._
+import org.apache.spark.SparkContext
+
+/**
+ * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS 
tests create a mock DFS in
+ * memory, while local disk test create a temp directory. All these 
temporal storages are deleted
+ * in the end.
+ */
+
+class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll {
+  private var sc: SparkContext = _
+  private var dfs: MiniDFSCluster = _
+
+  override def beforeAll() {
+sc = new SparkContext(local, test)
+sc.hadoopConfiguration.set(dfs.datanode.data.dir.perm, 
SmallTextFilesSuite.dirPermission())
+dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true,
+ Array(/rack0, /rack0, /rack1, /rack1),
+ Array(host0, host1, host2, host3))
+  }
+
+  override def afterAll() {
+if (dfs != null) dfs.shutdown()
+sc.stop()
+System.clearProperty(spark.driver.port)
+  }
+
+
+  private def createHDFSFile(
+  fs: FileSystem,
+  inputDir: Path,
+  fileName: String,
+  contents: Array[Byte]) = {
+val out: DataOutputStream = fs.create(new Path(inputDir, fileName), 
true, 4096, 2, 512, null)
+out.write(contents, 0, contents.length)
+out.close()
+System.out.println(Wrote HDFS file)
+  }
+
+  /**
+   * This code will test the behaviors on HDFS. There are three aspects to 
test:
+   *1) is all files are read.
+   *2) is the fileNames are read correctly.
+   *3) is the contents must be the same.
+   */
+  test(Small file input || HDFS IO) {
+val fs: FileSystem = dfs.getFileSystem
+val dir = /foo/
+val inputDir: Path = new Path(dir)
+
+
SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach {
+  case (fname, contents) =
+createHDFSFile(fs, inputDir, fname, contents)
+}
+
+println(sname node is 
${dfs.getNameNode.getNameNodeAddress.getHostName})
+println(sname node port is ${dfs.getNameNodePort})
+
+val hdfsAddressDir =
+  
shdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir
+println(sHDFS address dir is $hdfsAddressDir)
+
+val res = smallTextFiles(sc, hdfsAddressDir).collect()
+
+assert(res.size === SmallTextFilesSuite.fileNames.size,
+  Number of files read out do not fit with the actual value)
+
+for ((fname, contents) - res) {
+  assert(SmallTextFilesSuite.fileNames.contains(fname),
+sMissing file name $fname.)
+  assert(contents.hashCode === 
SmallTextFilesSuite.hashCodeOfContents(fname),
+sfile $fname contents can not match)
+}
+  }
+
+  private def createNativeFile(inputDir: JPath, fileName: String, 
contents: Array[Byte]) = {
+val out = new DataOutputStream(new 
FileOutputStream(s${inputDir.toString}/$fileName))
+out.write(contents, 0, contents.length)
+out.close()
+println(Wrote native file)
+  }
+
+  /**
+   * This code will test the behaviors on native file 

[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38245425
  
@mengxr There are 2 java files in my PR, and another 2 scala files - the 
MLUtils.scala and the test suite. I just find the scala code style in the 
[style 
page](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide).
 Does java code also use 2 space indentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38248858
  
Yes. If you're not sure about the right style for something, try to follow 
the style of the existing codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38025512
  
If I understand the purpose correctly, this PR is for reading small text 
files. But most of the code is to handle the corner case when a file's size is 
greater than 2GB. You mentioned Mahout hit this problem. What was the use case 
there? If someone needs to concat several 2GB bytes buffers to create a single 
Text record, very likely he/she are not doing the right thing, IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38128543
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38128624
  
@mengxr Your advise makes sense. I remove the merge process away from 
`smallTextFiles()`, and rewrite the reading logic in `RecoderReader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38128995
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38128996
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13287/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-19 Thread yinxusen
Github user yinxusen commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-38129529
  
Ah... It seems that Jenkins causes problem. The last two commits test 
failed due to this error:

 Fetching upstream changes from https://github.com/apache/spark.git
 ERROR: Timeout after 10 minutes
 FATAL: Failed to fetch from https://github.com/apache/spark.git

Who can help to have a look? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10692101
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
--- End diff --

`CombineFileInputFormat` is to combine splits but not records. If the file 
is long and you cut it at `MAX_BYTES_ALLOCATION`, how do you guarantee this 
would be a valid cut for UTF-8 text?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10692570
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
--- End diff --

Here is really a mistake that I ignore.

I use `MAX_BYTES_ALLOCATION` here to avoid the scenario that a file length 
is bigger than what an `Int.maxValue` can represent. Mahout implementation just 
ignore the case, they convert `long` directly into `int`. Though the semantic 
of the interface is small file, but we should not limit the input file length 
for end-user.

I just treat it as bytes array which encapsulated in a `Text`, I will join 
the slices of a single file together in `smallTextFiles()` interface. Due to 
the locality of split, I can merge them together without shuffle. But I forget 
it in this version, I'll fix it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-37910847
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13234/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-37910845
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-18 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-37970513
  
@yinxusen I'm still confused by the implementation. You want to make the 
entire content of a text file as the value returned by smallTextFiles. In the 
BatchFileRecordReader, the buffer size is limited by `MAX_BYTES_ALLOCATION`, 
and you merge the buffers from the same file in smallTextFiles to assemble the 
entire content. Is it necessary to take this approach? How does it differ from 
a simple `CombineWholeTextFileInputFormat`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-37899280
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10690775
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
--- End diff --

Instead of `util`, how about `io` or `input`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10690798
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk. It will be 
called by
+ * HadoopRDD to generate new BatchFileRecordReader.
+ */
+public class BatchFileInputFormat
--- End diff --

This is basically the `WholeFileInputFormat` for `Text`:


https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java

Shall we call it `WholeTextFileInputFormat`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10690809
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk. It will be 
called by
+ * HadoopRDD to generate new BatchFileRecordReader.
+ */
+public class BatchFileInputFormat
+extends CombineFileInputFormatString, Text {
--- End diff --

The indentation you used is not consistent with Spark code style. This line 
may fit the line above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10690831
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk. It will be 
called by
+ * HadoopRDD to generate new BatchFileRecordReader.
+ */
+public class BatchFileInputFormat
+extends CombineFileInputFormatString, Text {
+
+@Override
+protected boolean isSplitable(JobContext context, Path file) {
+return false;
+}
+@Override
+public RecordReaderString, Text createRecordReader(
+InputSplit split,
+TaskAttemptContext context) throws IOException {
+return new CombineFileRecordReaderString, Text(
+(CombineFileSplit)split,
--- End diff --

put a space after `)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10690953
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
--- End diff --

Are you reading the entire file in bytes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10691012
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
--- End diff --

I think `input` is better, because input of machine learning algorithm is 
usually datasets, batch/streaming files, in different format. Meantime, the 
`output` is usually models or prediction results, which has a different 
semantic meaning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10691105
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.JobContext;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * The specific InputFormat reads files in HDFS or local disk. It will be 
called by
+ * HadoopRDD to generate new BatchFileRecordReader.
+ */
+public class BatchFileInputFormat
+extends CombineFileInputFormatString, Text {
--- End diff --

Sorry for my carelessness. I'll fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/164#discussion_r10691283
  
--- Diff: 
mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java ---
@@ -0,0 +1,117 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.util;
+
+import java.io.IOException;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+/**
+ * Reads an entire file out in bytes format in filename, content format.
--- End diff --

In the life cycle of an instance of this class, the answer is yes. You 
know, the `split(index)` here indicates a single file. Because I set 
`isSplitable()` in `BatchFileInputFormat` to false.

The `nextKeyValue()` could be called several times, until not more bytes 
left in the `split(index)`, by `nextKeyValue()` function in 
`CombineFileInputFormat`, and eventually the `compute()` in `HadoopRDD`. It 
will read an entire file in bytes, and I encapsulate these bytes into `Text`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib

2014-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/164#issuecomment-37901339
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13224/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---