Thanks this really helps.
As long as I stick to HDFS paths, and files I’m good. I do know that code a bit
but have never used it to say take input from one cluster via
“hdfs://server:port/path” and output to another via
“hdfs://another-server:another-port/path”. This seems to be supported by
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since
Spark supports several FS schemes I’m unclear about how much to assume about
using the hadoop file systems APIs and conventions. Concretely if I pass a
pattern in with a HTTPS file system, will the pattern work?
Spark's
sc.textFile()https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
method
delegates to sc.hadoopFile(), which uses Hadoop's
Not that I know of. We were discussing it on another thread and it came up.
I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
But that's not