Hi,
Internally Spark uses HDFS api to handle file data. Have a look at HAR,
Sequence file input format. More information on this cloudera blog
<http://blog.cloudera.com/blog/2009/02/the-small-files-problem/>.

Regards,
Madhukara Phatak
http://datamantra.io/

On Sun, Mar 15, 2015 at 9:59 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Ah most interesting—thanks.
>
> So it seems sc.textFile(longFileList) has to read all metadata before
> starting the read for partitioning purposes so what you do is not use it?
>
> You create a task per file that reads one file (in parallel) per task
> without scanning for _all_ metadata. Can’t argue with the logic but perhaps
> Spark should incorporate something like this in sc.textFile? My case can’t
> be that unusual especially since I am periodically processing micro-batches
> from Spark Streaming. In fact Actually I have to scan HDFS to create the
> longFileList to begin with so get file status and therefore probably all
> the metadata needed by sc.textFile. Your method would save one scan, which
> is good.
>
> Might a better sc.textFile take a beginning URI, a file pattern regex, and
> a recursive flag? Then one scan could create all metadata automatically for
> a large subset of people using the function, something like
>
>     sc.textFile(beginDir: String, filePattern: String = “^part.*”,
> recursive: Boolean = false)
>
> I fact it should be easy to create BetterSC that overrides the textFile
> method with a re-implementation that only requires one scan to get
> metadata.
>
> Just thinking on email…
>
> On Mar 14, 2015, at 11:11 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>
> Here is how I have dealt with many small text files (on s3 though this
> should generalize) in the past:
>
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
>
>
>
>
>> FromMichael Armbrust <mich...@databricks.com>SubjectRe:
>> S3NativeFileSystem inefficient implementation when calling sc.textFile
>> DateThu, 27 Nov 2014 03:20:14 GMT
>>
>> In the past I have worked around this problem by avoiding sc.textFile().
>> Instead I read the data directly inside of a Spark job.  Basically, you
>> start with an RDD where each entry is a file in S3 and then flatMap that
>> with something that reads the files and returns the lines.
>>
>> Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe
>>
>> Using this class you can do something like:
>>
>> sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... ::
>> Nil).flatMap(new ReadLinesSafe(_))
>>
>> You can also build up the list of files by running a Spark 
>> job:https://gist.github.com/marmbrus/15e72f7bc22337cf6653
>>
>> Michael
>>
>>
>> On Sat, Mar 14, 2015 at 10:38 AM, Pat Ferrel <p...@occamsmachete.com>
>> wrote:
>>
>>> It’s a long story but there are many dirs with smallish part-xxxx files
>>> in them so we create a list of the individual files as input
>>> to sparkContext.textFile(fileList). I suppose we could move them and rename
>>> them to be contiguous part-xxxx files in one dir. Would that be better than
>>> passing in a long list of individual filenames? We could also make the part
>>> files much larger by collecting the smaller ones. But would any of this
>>> make a difference in IO speed?
>>>
>>> I ask because using the long file list seems to read, what amounts to a
>>> not very large data set rather slowly. If it were all in large part files
>>> in one dir I’d expect it to go much faster but this is just intuition.
>>>
>>>
>>> On Mar 14, 2015, at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>> why can you not put them in a directory and read them as one input? you
>>> will get a task per file, but spark is very fast at executing many tasks
>>> (its not a jvm per task).
>>>
>>> On Sat, Mar 14, 2015 at 12:51 PM, Pat Ferrel <p...@occamsmachete.com>
>>> wrote:
>>>
>>>> Any advice on dealing with a large number of separate input files?
>>>>
>>>>
>>>> On Mar 13, 2015, at 4:06 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>>
>>>> We have many text files that we need to read in parallel. We can create
>>>> a comma delimited list of files to pass in to
>>>> sparkContext.textFile(fileList). The list can get very large (maybe 10000)
>>>> and is all on hdfs.
>>>>
>>>> The question is: what is the most performant way to read them? Should
>>>> they be broken up and read in groups appending the resulting RDDs or should
>>>> we just pass in the entire list at once? In effect I’m asking if Spark does
>>>> some optimization of whether we should do it explicitly. If the later, what
>>>> rule might we use depending on our cluster setup?
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>
>
>

Reply via email to