[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38653629 @yinxusen It looks good to me now. @rxin , could you please make a final scan and merge it if it looks good to you? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10965746 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.ByteStreams; +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a single whole text file out in a + * key-value pair, where the key is the file path and the value is the entire content of the file. + */ +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; + + private FileSystem fs; + + public WholeTextFileRecordReader( + CombineFileSplit split, + TaskAttemptContext context, + Integer index) +throws IOException { +path = split.getPath(index); +fs = path.getFileSystem(context.getConfiguration()); + } + + @Override + public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { + } + + @Override + public void close() throws IOException { + } + + @Override + public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; + } + + @Override + public String getCurrentKey() throws IOException, InterruptedException { +return key; + } + + @Override + public Text getCurrentValue() throws IOException, InterruptedException{ --- End diff -- add a space after InterruptedException --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10965749 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.ByteStreams; +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An {@link org.apache.hadoop.mapreduce.RecordReader} for reading a single whole text file out in a + * key-value pair, where the key is the file path and the value is the entire content of the file. + */ +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; --- End diff -- can u add some inline comment to explain what processed means? You don't want the reader to go through nextKeyValue to find out what it does .. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10965782 --- Diff: project/SparkBuild.scala --- @@ -358,7 +358,7 @@ object SparkBuild extends Build { def mllibSettings = sharedSettings ++ Seq( name := spark-mllib, libraryDependencies ++= Seq( - org.jblas % jblas % 1.2.3 + org.jblas% jblas % 1.2.3 --- End diff -- Let's not change this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10965793 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala --- @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.input.WholeTextFileInputFormat +import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext +import org.apache.spark.SparkContext._ + +/** + * Extra functions available on SparkContext of mllib through an implicit conversion. Import + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions. + */ +class MLContext(self: SparkContext) { + + /** + * Read a directory of text files from HDFS, a local file system (available on all nodes), or any --- End diff -- An example here would be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10965949 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + +import java.io.DataOutputStream +import java.io.FileOutputStream +import java.nio.file.Files --- End diff -- Does nio.file.Files exist on JDK 6? If not... let's change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10966132 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more --- End diff -- Never mind, I'd like to do that. Let me try to change them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38667186 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38667187 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38671709 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13468/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38671708 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10993561 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import org.apache.spark.mllib.input.WholeTextFileInputFormat +import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext + +/** + * Extra functions available on SparkContext of mllib through an implicit conversion. Import + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions. + */ +class MLContext(self: SparkContext) { + + /** + * Read a directory of text files from HDFS, a local file system (available on all nodes), or any + * Hadoop-supported file system URI. For example, + * {{{ + * hdfs://a-hdfs-path/part-0 --- End diff -- Try this ... ``` /* If you have the following files: hdfs://a-hdfs-path/part-0 hdfs://a-hdfs-path/part-1 ... hdfs://a-hdfs-path/part-n */ val rdd = mlContext.wholeTextFile(hdfs://a-hdfs-path) // RDD contains // explain what the content looks like ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10993628 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/input/WholeTextFileRecordReader.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input + +import com.google.common.io.{ByteStreams, Closeables} + +import org.apache.hadoop.io.Text +import org.apache.hadoop.mapreduce.InputSplit +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit +import org.apache.hadoop.mapreduce.RecordReader +import org.apache.hadoop.mapreduce.TaskAttemptContext + +/** + * A [[org.apache.hadoop.mapreduce.RecordReader RecordReader]] for reading a single whole text file + * out in a key-value pair, where the key is the file path and the value is the entire content of + * the file. + */ +private[mllib] class WholeTextFileRecordReader( +split: CombineFileSplit, +context: TaskAttemptContext, +index: Integer) + extends RecordReader[String, String] { + + private val path = split.getPath(index) + private val fs = path.getFileSystem(context.getConfiguration) + + // True means the current file has been processed, then skip it. + private var processed = false + + private val key = path.toString + private var value: String = null + + override def initialize(split: InputSplit, context: TaskAttemptContext) = {} + + override def close() = {} + + override def getProgress = if(processed) 1.0f else 0.0f + + override def getCurrentKey = key + + override def getCurrentValue = value + + override def nextKeyValue = { +if(!processed) { --- End diff -- add a space after if --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38727263 lgtm other than the couple minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38756255 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38756257 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38761355 Somehow this is no longer mergeable. Do you mind bring it up to date? Everything looks good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38761534 Sure, let me update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38765336 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38765347 Oh... Is that OK? That's strange... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38769629 Yea - perhaps apply it to trunk. Feel free to create another PR if needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38769753 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13493/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen closed the pull request at: https://github.com/apache/spark/pull/164 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38562620 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38562621 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38568166 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38568167 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13432/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10954776 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala --- @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import org.apache.hadoop.io.Text + +import org.apache.spark.Logging +import org.apache.spark.mllib.input.WholeTextFileInputFormat +import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext +import org.apache.spark.SparkContext._ + +/** + * Extra functions available on SparkContext of mllib through an implicit conversion. Import + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions. + */ +class MLContext(self: SparkContext) extends Logging { + + /** + * Read a directory of text files from HDFS, a local file system (available on all nodes), or any + * Hadoop-supported file system URI. Each file is read as a single record and returned in a + * key-value pair, where the key is the path of each file, and the value is the content of each + * file. + */ + def wholeTextFile(path: String): RDD[(String, String)] = { +self.newAPIHadoopFile( + path, + classOf[WholeTextFileInputFormat], + classOf[String], + classOf[Text]).mapValues(_.toString) + } +} + + +/** + * The MLContext object contains a number of implicit conversions and parameters for use with + * various mllib features. + */ +object MLContext extends Logging { + implicit def mlContextToSparkContext(sc: SparkContext) = new MLContext(sc) --- End diff -- sparkContextToMLContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10955063 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/MLContext.scala --- @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib + +import org.apache.hadoop.io.Text + +import org.apache.spark.Logging +import org.apache.spark.mllib.input.WholeTextFileInputFormat +import org.apache.spark.rdd.RDD +import org.apache.spark.SparkContext +import org.apache.spark.SparkContext._ + +/** + * Extra functions available on SparkContext of mllib through an implicit conversion. Import + * `org.apache.spark.mllib.MLContext._` at the top of your program to use these functions. + */ +class MLContext(self: SparkContext) extends Logging { --- End diff -- Logging is not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38643350 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10873319 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format. + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader. --- End diff -- Sorry I forgot this is Java. Then you should use `@link` instead of `[[]]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10877199 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading whole text file out in + * (filename, content) format. Each element in split is an record of a unique, whole file. File name + * is the full path name for easy deduplicate. + */ +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; + + private FileSystem fs; + + public WholeTextFileRecordReader( + CombineFileSplit split, + TaskAttemptContext context, + Integer index) +throws IOException { +path = split.getPath(index); +fs = path.getFileSystem(context.getConfiguration()); + } + + @Override + public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { + } + + @Override + public void close() throws IOException { + } + + @Override + public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; + } + + @Override + public String getCurrentKey() throws IOException, InterruptedException { +return key; + } + + @Override + public Text getCurrentValue() throws IOException, InterruptedException{ +return value; + } + + @Override + public boolean nextKeyValue() throws IOException { +if (!processed) { + if (key == null) { +key = path.toString(); + } + if (value == null) { +value = new Text(); + } + + FSDataInputStream fileIn = null; + try { +fileIn = fs.open(path); +byte[] innerBuffer = IOUtils.toByteArray(fileIn); --- End diff -- @mengxr @yinxusen PS I am happy to take on removal of use of commons-io in favor of equivalents in Guava. There is a bit more than this usage, but it's easy stuff. Commons IO is fine but it's not necessary to use it here and it is one of those dependencies that could collide with other versions in Hadoop. If anyone nods I'll open a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38436606 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38436605 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38441593 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10912844 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading whole text file out in --- End diff -- change `code` to `{@link ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10912852 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * An codeorg.apache.hadoop.mapreduce.RecordReader/code for reading whole text file out in + * (filename, content) format. Each element in split is an record of a unique, whole file. File name --- End diff -- (path, content) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10912943 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -120,4 +122,17 @@ object MLUtils { } sum } + + /** + * Read a directory of text files from HDFS, a local file system (available on all nodes), or any + * Hadoop-supported file system URI, and return it as an RDD of (filename, content) both in String --- End diff -- (path, content) Also need to mention the entire file is read as a single record. See `WholeTextInputFormat` for reference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38518467 @yinxusen I think the code looks good to me except a few docs. I'm not sure whether `MLUtils` is a good place for `wholeTextFile`. We can pimp `SparkContext` and put this method there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38519654 @mengxr I talked to @liancheng about the placement of WholeTextFiles interface, we have no idea of whether it is a commonly used interface or not at that time, so we choose the MLUtils instead of SparkContexts in cautious. If the community thinks it is good to put it in SparkContexts, I am glad to do that. :) @yinxusen I think the code looks good to me except a few docs. I'm not sure whether MLUtils is a good place for wholeTextFile. We can pimp SparkContext and put this method there. -- Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38520669 @yinxusen I didn't mean putting it under `SparkContext` directly, but something like `MLContext`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38524328 Yes, we can put it in MLContext and add an implicit conversion from SparkContext to MLContext, so that the API can be added in a non-intrusive manner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865876 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -17,12 +17,14 @@ package org.apache.spark.mllib.util -import org.apache.spark.SparkContext -import org.apache.spark.rdd.RDD -import org.apache.spark.SparkContext._ - +import org.apache.hadoop.io.Text --- End diff -- okay --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865885 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -120,4 +122,21 @@ object MLUtils { } sum } + + /** + * Reads a bunch of whole files from HDFS, or a local file system (available on all nodes), or any + * Hadoop-supported file system URI, and return an RDD[(String, String)]. + * + * @param path The directory you should specified, such as + * hdfs://[address]:[port]/[dir] + * + * @return The first String is a file name, the second one is its content. --- End diff -- Need to improve the doc. See my comment on `WholeTextInputFormat` and the doc of `textFile` as a reference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865893 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.DataOutputStream +import java.io.FileOutputStream +import java.nio.file.Files +import java.nio.file.Path +import java.nio.file.Paths + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary + * directory is created as fake input. Temporal storage would be deleted in the end. + */ +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + + override def beforeAll() { +sc = new SparkContext(local, test) + } + + override def afterAll() { +sc.stop() + } + + private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) --- End diff -- remove println --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865898 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.DataOutputStream +import java.io.FileOutputStream +import java.nio.file.Files +import java.nio.file.Path +import java.nio.file.Paths + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary + * directory is created as fake input. Temporal storage would be deleted in the end. + */ +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + + override def beforeAll() { +sc = new SparkContext(local, test) + } + + override def afterAll() { +sc.stop() + } + + private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) + } + + /** + * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are + * three aspects to check: + * 1) is all files are read. + * 2) is the fileNames are read correctly. --- End diff -- whether paths are read correctly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865907 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.DataOutputStream +import java.io.FileOutputStream +import java.nio.file.Files +import java.nio.file.Path +import java.nio.file.Paths + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary + * directory is created as fake input. Temporal storage would be deleted in the end. + */ +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + + override def beforeAll() { +sc = new SparkContext(local, test) + } + + override def afterAll() { +sc.stop() + } + + private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) + } + + /** + * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are + * three aspects to check: + * 1) is all files are read. + * 2) is the fileNames are read correctly. + * 3) is the contents must be the same. + */ + test(Correctness of WholeTextFileRecordReader.) { + +val dir = Files.createTempDirectory(wholefiles) +println(snative disk address is ${dir.toString}) + +WholeTextFileRecordReaderSuite.fileNames + .zip(WholeTextFileRecordReaderSuite.filesContents) + .foreach { case (fname, contents) = +createNativeFile(dir, fname, contents) +} + +val res = wholeTextFile(sc, dir.toString).collect() + +assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size, + Number of files read out do not fit with the actual value) --- End diff -- put res.size and fileNames.size into error message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10865913 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileRecordReaderSuite.scala --- @@ -0,0 +1,114 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.DataOutputStream +import java.io.FileOutputStream +import java.nio.file.Files +import java.nio.file.Path +import java.nio.file.Paths + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests the correctness of [[org.apache.spark.mllib.input.WholeTextFileRecordReader]]. A temporary + * directory is created as fake input. Temporal storage would be deleted in the end. + */ +class WholeTextFileRecordReaderSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + + override def beforeAll() { +sc = new SparkContext(local, test) + } + + override def afterAll() { +sc.stop() + } + + private def createNativeFile(inputDir: Path, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) + } + + /** + * This code will test the behaviors of WholeTextFileRecordReader based on local disk. There are + * three aspects to check: + * 1) is all files are read. + * 2) is the fileNames are read correctly. + * 3) is the contents must be the same. + */ + test(Correctness of WholeTextFileRecordReader.) { + +val dir = Files.createTempDirectory(wholefiles) +println(snative disk address is ${dir.toString}) + +WholeTextFileRecordReaderSuite.fileNames + .zip(WholeTextFileRecordReaderSuite.filesContents) + .foreach { case (fname, contents) = +createNativeFile(dir, fname, contents) +} + +val res = wholeTextFile(sc, dir.toString).collect() + +assert(res.size === WholeTextFileRecordReaderSuite.fileNames.size, + Number of files read out do not fit with the actual value) + +for ((fname, contents) - res) { + val shortName = fname.split('/').last + assert(WholeTextFileRecordReaderSuite.fileNames.contains(shortName), +sMissing file name $fname.) + assert(contents.hashCode === WholeTextFileRecordReaderSuite.hashCodeOfContents(shortName), --- End diff -- verify text directly instead of hashcode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10871469 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk into pair (filename, content) format. + * It will be called by HadoopRDD to generate new WholeTextFileRecordReader. --- End diff -- \code\org.apache.hadoop.mapreduce.lib.CombineFileInputFormat\/code\ should be used, I think. Java code file could not use [[]]. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38409137 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38409136 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38410992 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13379/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38410991 Build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38346619 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38346620 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38347601 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38253432 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10828986 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk. It will be called by + * HadoopRDD to generate new WholeTextFileRecordReader. --- End diff -- Need definition of `key` and `value`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829033 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in filename, content format. + */ + +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; + + private FileSystem fs; + + public WholeTextFileRecordReader( + CombineFileSplit split, + TaskAttemptContext context, + Integer index) +throws IOException { +path = split.getPath(index); +fs = path.getFileSystem(context.getConfiguration()); + } + + @Override + public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { + } + + @Override + public void close() throws IOException { + } + + @Override + public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; + } + + @Override + public String getCurrentKey() throws IOException, InterruptedException { +return key; --- End diff -- fix indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829067 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/WholeTextFileRecordReader.java --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.commons.io.IOUtils; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in filename, content format. + */ + +public class WholeTextFileRecordReader extends RecordReaderString, Text { + private Path path; + + private String key = null; + private Text value = null; + + private boolean processed = false; + + private FileSystem fs; + + public WholeTextFileRecordReader( --- End diff -- Need doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829200 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -120,4 +122,22 @@ object MLUtils { } sum } + + /** + * Reads a bunch of whole files from HDFS, or a local file system (available on all nodes), or any + * Hadoop-supported file system URI, and return an RDD[(String, String)]. + * + * @param path The directory you should specified, such as + * hdfs://[address]:[port]/[dir] + * + * @return RDD[(fileName: String, content: String)] + * i.e. the first is the file name of a file, the second one is its content. --- End diff -- do not need to repeat `RDD[(fileName: String, content: String)]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829741 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.{BufferedReader, DataOutputStream, FileOutputStream, InputStreamReader} +import java.nio.file.Files +import java.nio.file.{Path = JPath} +import java.nio.file.{Paths = JPaths} + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.fs.FileSystem +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hdfs.MiniDFSCluster +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS tests create a mock DFS in + * memory, while local disk test create a temp directory. All these temporal storages are deleted + * in the end. + */ + +class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll { --- End diff -- Indeed, we only need to write a unit test for `WholeTextFileRecordReader`. It is not necessary to verify that `wholeTextFile` can load files from HDFS. We are actually testing `newAPIHadoopFile` here. Maybe we can get some suggestions from others. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10829857 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/WholeTextFileSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.{BufferedReader, DataOutputStream, FileOutputStream, InputStreamReader} +import java.nio.file.Files +import java.nio.file.{Path = JPath} +import java.nio.file.{Paths = JPaths} + +import scala.collection.immutable.IndexedSeq + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.fs.FileSystem +import org.apache.hadoop.fs.Path +import org.apache.hadoop.hdfs.MiniDFSCluster +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests HDFS IO and local disk IO of [[wholeTextFile]] in MLutils. HDFS tests create a mock DFS in + * memory, while local disk test create a temp directory. All these temporal storages are deleted + * in the end. + */ + +class WholeTextFileSuite extends FunSuite with BeforeAndAfterAll { --- End diff -- Yep, it will be much easy if we just test `WholeTextFileRecordReader` alone. I can construct a `CombineFileSplit` and test it. The correctness of reading files from HDFS and native disk is guaranteed by other tests, I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38143992 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10785909 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); --- End diff -- You are reading the entire file, which is not splitable. Should offset always be 0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10785949 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); +length = (int) split.getLength(index); +fs = path.getFileSystem(context.getConfiguration()); +} + +@Override +public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { +} + +@Override +public void close() throws IOException { +} + +@Override +public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; --- End diff -- This is not correct because there might be multiple files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10785965 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); +length = (int) split.getLength(index); +fs = path.getFileSystem(context.getConfiguration()); +} + +@Override +public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { +} + +@Override +public void close() throws IOException { +} + +@Override +public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; +} + +@Override +public String getCurrentKey() throws IOException, InterruptedException { +return key; +} + +@Override +public Text getCurrentValue() throws IOException, InterruptedException{ +return value; +} + +@Override +public boolean nextKeyValue() throws IOException { +if (!processed) { --- End diff -- So it only reads the first file and throws away the rest? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786005 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); +length = (int) split.getLength(index); +fs = path.getFileSystem(context.getConfiguration()); +} + +@Override +public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { +} + +@Override +public void close() throws IOException { +} + +@Override +public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; +} + +@Override +public String getCurrentKey() throws IOException, InterruptedException { +return key; +} + +@Override +public Text getCurrentValue() throws IOException, InterruptedException{ +return value; +} + +@Override +public boolean nextKeyValue() throws IOException { +if (!processed) { +if (key == null) { +key = path.getName(); +} +if (value == null) { +value = new Text(); +} + +FSDataInputStream fileIn = null; +try { +fileIn = fs.open(path); +fileIn.seek(startOffset); +byte[] innerBuffer = new byte[length]; +IOUtils.readFully(fileIn, innerBuffer, 0, length); --- End diff -- use `IOUtils.toByteArray(fileIn)` for simplicity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786181 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -120,4 +122,22 @@ object MLUtils { } sum } + + /** + * Reads a bunch of small files from HDFS, or a local file system (available on all nodes), or any + * Hadoop-supported file system URI, and return an RDD[(String, String)]. + * + * @param path The directory you should specified, such as + * hdfs://[address]:[port]/[dir] + * + * @return RDD[(fileName: String, content: String)] + * i.e. the first is the file name of a file, the second one is its content. + */ + def smallTextFiles(sc: SparkContext, path: String): RDD[(String, String)] = { --- End diff -- The current API uses `textFile` to load text files. I suggest using `wholeTextFile` to load the entire content of each file. `small` is not a good name here because it is not defined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786231 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); +length = (int) split.getLength(index); +fs = path.getFileSystem(context.getConfiguration()); +} + +@Override +public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { +} + +@Override +public void close() throws IOException { +} + +@Override +public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; +} + +@Override +public String getCurrentKey() throws IOException, InterruptedException { +return key; +} + +@Override +public Text getCurrentValue() throws IOException, InterruptedException{ +return value; +} + +@Override +public boolean nextKeyValue() throws IOException { +if (!processed) { --- End diff -- No, it doesn't. One instance of `BatchFilesRecordReader` just processes only one file. In the constructor of this class, it passes an `Integer index` in, which represents the current file in the file sets. Same reason for the above question, i.e. the `getProcess()` function.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786403 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream} +import java.nio.file.Files +import java.nio.file.{Path = JPath} +import java.nio.file.{Paths = JPaths} + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.hdfs.MiniDFSCluster +import org.apache.hadoop.fs.FileSystem +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in + * memory, while local disk test create a temp directory. All these temporal storages are deleted + * in the end. + */ + +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + private var dfs: MiniDFSCluster = _ + + override def beforeAll() { +sc = new SparkContext(local, test) +sc.hadoopConfiguration.set(dfs.datanode.data.dir.perm, SmallTextFilesSuite.dirPermission()) +dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true, + Array(/rack0, /rack0, /rack1, /rack1), + Array(host0, host1, host2, host3)) + } + + override def afterAll() { +if (dfs != null) dfs.shutdown() +sc.stop() +System.clearProperty(spark.driver.port) + } + + + private def createHDFSFile( + fs: FileSystem, + inputDir: Path, + fileName: String, + contents: Array[Byte]) = { +val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null) +out.write(contents, 0, contents.length) +out.close() +System.out.println(Wrote HDFS file) + } + + /** + * This code will test the behaviors on HDFS. There are three aspects to test: + *1) is all files are read. + *2) is the fileNames are read correctly. + *3) is the contents must be the same. + */ + test(Small file input || HDFS IO) { +val fs: FileSystem = dfs.getFileSystem +val dir = /foo/ +val inputDir: Path = new Path(dir) + + SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach { + case (fname, contents) = +createHDFSFile(fs, inputDir, fname, contents) +} + +println(sname node is ${dfs.getNameNode.getNameNodeAddress.getHostName}) +println(sname node port is ${dfs.getNameNodePort}) + +val hdfsAddressDir = + shdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir +println(sHDFS address dir is $hdfsAddressDir) + +val res = smallTextFiles(sc, hdfsAddressDir).collect() + +assert(res.size === SmallTextFilesSuite.fileNames.size, + Number of files read out do not fit with the actual value) + +for ((fname, contents) - res) { + assert(SmallTextFilesSuite.fileNames.contains(fname), +sMissing file name $fname.) + assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname), +sfile $fname contents can not match) +} + } + + private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) + } + + /** + * This code will test the behaviors on native file
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786514 --- Diff: mllib/src/main/java/org/apache/spark/mllib/input/BatchFilesRecordReader.java --- @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.input; + +import java.io.IOException; + +import com.google.common.io.Closeables; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.IOUtils; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. + */ + +public class BatchFilesRecordReader extends RecordReaderString, Text { +private long startOffset; +private int length; +private Path path; + +private String key = null; +private Text value = null; + +private boolean processed = false; + +private FileSystem fs; + +public BatchFilesRecordReader( +CombineFileSplit split, +TaskAttemptContext context, +Integer index) +throws IOException { +path = split.getPath(index); +startOffset = split.getOffset(index); +length = (int) split.getLength(index); +fs = path.getFileSystem(context.getConfiguration()); +} + +@Override +public void initialize(InputSplit arg0, TaskAttemptContext arg1) +throws IOException, InterruptedException { +} + +@Override +public void close() throws IOException { +} + +@Override +public float getProgress() throws IOException { +return processed ? 1.0f : 0.0f; +} + +@Override +public String getCurrentKey() throws IOException, InterruptedException { +return key; +} + +@Override +public Text getCurrentValue() throws IOException, InterruptedException{ +return value; +} + +@Override +public boolean nextKeyValue() throws IOException { +if (!processed) { --- End diff -- Okay. I misread the type of BatchFilesRecordReader. Its name is a little confusing because it only reads one file per instance. I still recommend using `WholeTextFileRecordReader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38146292 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38146291 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38146374 Btw, please fix style issues. We use 2-space indentation in Spark, and there are more at the Spark Code Style page. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10786751 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/SmallTextFilesSuite.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util + + +import java.io.{InputStreamReader, BufferedReader, DataOutputStream, FileOutputStream} +import java.nio.file.Files +import java.nio.file.{Path = JPath} +import java.nio.file.{Paths = JPaths} + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.FunSuite + +import org.apache.hadoop.hdfs.MiniDFSCluster +import org.apache.hadoop.fs.FileSystem +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.Text + +import org.apache.spark.mllib.util.MLUtils._ +import org.apache.spark.SparkContext + +/** + * Tests HDFS IO and local disk IO of [[smallTextFiles]] in MLutils. HDFS tests create a mock DFS in + * memory, while local disk test create a temp directory. All these temporal storages are deleted + * in the end. + */ + +class SmallTextFilesSuite extends FunSuite with BeforeAndAfterAll { + private var sc: SparkContext = _ + private var dfs: MiniDFSCluster = _ + + override def beforeAll() { +sc = new SparkContext(local, test) +sc.hadoopConfiguration.set(dfs.datanode.data.dir.perm, SmallTextFilesSuite.dirPermission()) +dfs = new MiniDFSCluster(sc.hadoopConfiguration, 4, true, + Array(/rack0, /rack0, /rack1, /rack1), + Array(host0, host1, host2, host3)) + } + + override def afterAll() { +if (dfs != null) dfs.shutdown() +sc.stop() +System.clearProperty(spark.driver.port) + } + + + private def createHDFSFile( + fs: FileSystem, + inputDir: Path, + fileName: String, + contents: Array[Byte]) = { +val out: DataOutputStream = fs.create(new Path(inputDir, fileName), true, 4096, 2, 512, null) +out.write(contents, 0, contents.length) +out.close() +System.out.println(Wrote HDFS file) + } + + /** + * This code will test the behaviors on HDFS. There are three aspects to test: + *1) is all files are read. + *2) is the fileNames are read correctly. + *3) is the contents must be the same. + */ + test(Small file input || HDFS IO) { +val fs: FileSystem = dfs.getFileSystem +val dir = /foo/ +val inputDir: Path = new Path(dir) + + SmallTextFilesSuite.fileNames.zip(SmallTextFilesSuite.filesContents).foreach { + case (fname, contents) = +createHDFSFile(fs, inputDir, fname, contents) +} + +println(sname node is ${dfs.getNameNode.getNameNodeAddress.getHostName}) +println(sname node port is ${dfs.getNameNodePort}) + +val hdfsAddressDir = + shdfs://${dfs.getNameNode.getNameNodeAddress.getHostName}:${dfs.getNameNodePort}$dir +println(sHDFS address dir is $hdfsAddressDir) + +val res = smallTextFiles(sc, hdfsAddressDir).collect() + +assert(res.size === SmallTextFilesSuite.fileNames.size, + Number of files read out do not fit with the actual value) + +for ((fname, contents) - res) { + assert(SmallTextFilesSuite.fileNames.contains(fname), +sMissing file name $fname.) + assert(contents.hashCode === SmallTextFilesSuite.hashCodeOfContents(fname), +sfile $fname contents can not match) +} + } + + private def createNativeFile(inputDir: JPath, fileName: String, contents: Array[Byte]) = { +val out = new DataOutputStream(new FileOutputStream(s${inputDir.toString}/$fileName)) +out.write(contents, 0, contents.length) +out.close() +println(Wrote native file) + } + + /** + * This code will test the behaviors on native file
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38245425 @mengxr There are 2 java files in my PR, and another 2 scala files - the MLUtils.scala and the test suite. I just find the scala code style in the [style page](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide). Does java code also use 2 space indentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38248858 Yes. If you're not sure about the right style for something, try to follow the style of the existing codebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38025512 If I understand the purpose correctly, this PR is for reading small text files. But most of the code is to handle the corner case when a file's size is greater than 2GB. You mentioned Mahout hit this problem. What was the use case there? If someone needs to concat several 2GB bytes buffers to create a single Text record, very likely he/she are not doing the right thing, IMHO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38128543 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38128624 @mengxr Your advise makes sense. I remove the merge process away from `smallTextFiles()`, and rewrite the reading logic in `RecoderReader`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38128995 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38128996 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13287/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-38129529 Ah... It seems that Jenkins causes problem. The last two commits test failed due to this error: Fetching upstream changes from https://github.com/apache/spark.git ERROR: Timeout after 10 minutes FATAL: Failed to fetch from https://github.com/apache/spark.git Who can help to have a look? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10692101 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. --- End diff -- `CombineFileInputFormat` is to combine splits but not records. If the file is long and you cut it at `MAX_BYTES_ALLOCATION`, how do you guarantee this would be a valid cut for UTF-8 text? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10692570 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. --- End diff -- Here is really a mistake that I ignore. I use `MAX_BYTES_ALLOCATION` here to avoid the scenario that a file length is bigger than what an `Int.maxValue` can represent. Mahout implementation just ignore the case, they convert `long` directly into `int`. Though the semantic of the interface is small file, but we should not limit the input file length for end-user. I just treat it as bytes array which encapsulated in a `Text`, I will join the slices of a single file together in `smallTextFiles()` interface. Due to the locality of split, I can merge them together without shuffle. But I forget it in this version, I'll fix it now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-37910847 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13234/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-37910845 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-37970513 @yinxusen I'm still confused by the implementation. You want to make the entire content of a text file as the value returned by smallTextFiles. In the BatchFileRecordReader, the buffer size is limited by `MAX_BYTES_ALLOCATION`, and you merge the buffers from the same file in smallTextFiles to assemble the entire content. Is it necessary to take this approach? How does it differ from a simple `CombineWholeTextFileInputFormat`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-37899280 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10690775 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; --- End diff -- Instead of `util`, how about `io` or `input`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10690798 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk. It will be called by + * HadoopRDD to generate new BatchFileRecordReader. + */ +public class BatchFileInputFormat --- End diff -- This is basically the `WholeFileInputFormat` for `Text`: https://github.com/tomwhite/hadoop-book/blob/master/ch07/src/main/java/WholeFileInputFormat.java Shall we call it `WholeTextFileInputFormat`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10690809 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk. It will be called by + * HadoopRDD to generate new BatchFileRecordReader. + */ +public class BatchFileInputFormat +extends CombineFileInputFormatString, Text { --- End diff -- The indentation you used is not consistent with Spark code style. This line may fit the line above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10690831 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk. It will be called by + * HadoopRDD to generate new BatchFileRecordReader. + */ +public class BatchFileInputFormat +extends CombineFileInputFormatString, Text { + +@Override +protected boolean isSplitable(JobContext context, Path file) { +return false; +} +@Override +public RecordReaderString, Text createRecordReader( +InputSplit split, +TaskAttemptContext context) throws IOException { +return new CombineFileRecordReaderString, Text( +(CombineFileSplit)split, --- End diff -- put a space after `)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10690953 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. --- End diff -- Are you reading the entire file in bytes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691012 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; --- End diff -- I think `input` is better, because input of machine learning algorithm is usually datasets, batch/streaming files, in different format. Meantime, the `output` is usually models or prediction results, which has a different semantic meaning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691105 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileInputFormat.java --- @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; +import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; +import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * The specific InputFormat reads files in HDFS or local disk. It will be called by + * HadoopRDD to generate new BatchFileRecordReader. + */ +public class BatchFileInputFormat +extends CombineFileInputFormatString, Text { --- End diff -- Sorry for my carelessness. I'll fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/164#discussion_r10691283 --- Diff: mllib/src/main/java/org/apache/spark/mllib/util/BatchFileRecordReader.java --- @@ -0,0 +1,117 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.util; + +import java.io.IOException; + +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +/** + * Reads an entire file out in bytes format in filename, content format. --- End diff -- In the life cycle of an instance of this class, the answer is yes. You know, the `split(index)` here indicates a single file. Because I set `isSplitable()` in `BatchFileInputFormat` to false. The `nextKeyValue()` could be called several times, until not more bytes left in the `split(index)`, by `nextKeyValue()` function in `CombineFileInputFormat`, and eventually the `compute()` in `HadoopRDD`. It will read an entire file in bytes, and I encapsulate these bytes into `Text`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1133] add small files input in MLlib
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/164#issuecomment-37901339 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13224/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---