In PySpark, I think you could enumerate all the valid files, and create RDD by
newAPIHadoopFile(), then union them together.

On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
<eric.d.fried...@gmail.com> wrote:
> I neglected to specify that I'm using pyspark. Doesn't look like these APIs 
> have been bridged.
>
> ----
> Eric Friedman
>
>> On Sep 14, 2014, at 11:02 PM, Nat Padmanabhan <reachn...@gmail.com> wrote:
>>
>> Hi Eric,
>>
>> Something along the lines of the following should work
>>
>> val fs = getFileSystem(...) // standard hadoop API call
>> val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
>> pathFilter).map(_.getPath.toString).mkString(",")  // pathFilter is an
>> instance of org.apache.hadoop.fs.PathFilter
>> val parquetRdd = sc.hadoopFile(filteredConcatenatedPaths,
>> classOf[ParquetInputFormat[Something]], classOf[Void],
>> classOf[SomeAvroType], getConfiguration(...))
>>
>> You have to do some initializations on ParquetInputFormat such as
>> AvroReadSetup/AvroWriteSupport etc but that you should be doing
>> already I am guessing.
>>
>> Cheers,
>> Nat
>>
>>
>> On Sun, Sep 14, 2014 at 7:37 PM, Eric Friedman
>> <eric.d.fried...@gmail.com> wrote:
>>> Hi,
>>>
>>> I have a directory structure with parquet+avro data in it. There are a
>>> couple of administrative files (.foo and/or _foo) that I need to ignore when
>>> processing this data or Spark tries to read them as containing parquet
>>> content, which they do not.
>>>
>>> How can I set a PathFilter on the FileInputFormat used to construct an RDD?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to