Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since 
Spark supports several FS schemes I’m unclear about how much to assume about 
using the hadoop file systems APIs and conventions. Concretely if I pass a 
pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction 
level beyond what is available in HDFS. In order to preserve that flexibility 
what APIs should I be using? It would be easy to say, HDFS only and use HDFS 
APIs but that would seem to limit things. Especially where you would like to 
read from one cluster and write to another. This is not so easy to do inside 
the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine 
the structure of the file system, I’m unclear how I should do it without 
sacrificing Spark’s flexibility.
 
On Apr 29, 2014, at 12:55 AM, Christophe Préaud <christophe.pre...@kelkoo.com> 
wrote:

Hi,

You can also use any path pattern as defined here: 
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
> Not that I know of. We were discussing it on another thread and it came up. 
> 
> I think if you look up the Hadoop FileInputFormat API (which Spark uses) 
> you'll see it mentioned there in the docs. 
> 
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
> 
> But that's not obvious.
> 
> Nick
> 
> 2014년 4월 28일 월요일, Pat Ferrel<pat.fer...@gmail.com> 님이 작성한 메시지:
> Perfect. 
> 
> BTW just so I know where to look next time, was that in some docs?
> 
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> 
> Yep, as I just found out, you can also provide 
> sc.textFile() with a comma-delimited string of all the files you want to load.
> 
> For example:
> 
> sc.textFile('/path/to/file1,/path/to/file2')
> So once you have your list of files, concatenate their paths like that and 
> pass the single string to 
> textFile().
> 
> Nick
> 
> 
> 
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> sc.textFile(URI) supports reading multiple files in parallel but only with a 
> wildcard. I need to walk a dir tree, match a regex to create a list of files, 
> then I’d like to read them into a single RDD in parallel. I understand these 
> could go into separate RDDs then a union RDD can be created. Is there a way 
> to create a single RDD from a URI list?
> 
> 


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Reply via email to