I think you can not use textFile() or binaryFile() or pickleFile()
here, it's different format than wav.

You could get a list of paths for all the files, then
sc.parallelize(), and foreach():

def process(path):
    # use subprocess to launch a process to do the job, read the
stdout as result

files = []  # a list of path of wav files
sc.parallelize(files, len(files)).foreach(process)

On Fri, Jan 16, 2015 at 2:11 PM, Venkat, Ankam
<ankam.ven...@centurylink.com> wrote:
> I need to process .wav files in Pyspark.  If the files are in local file
> system, I am able to process them.  Once I store them on HDFS, I am facing
> issues.  For example,
>
>
>
> I run a sox program on a wav file like this.
>
>
>
> sox ext2187854_03_27_2014.wav -n stats  <-- works fine
>
>
>
> sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats
> <-- Does not work as sox cannot read HDFS file.
>
>
>
> So, I do like this.
>
>
>
> hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav |
> sox -t wav - -n stats  <-- This works fine
>
>
>
> But, I am not able to do this in PySpark.
>
>
>
> wavfile =
> sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
>
> wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))
>
>
>
> I tried different options like sc.binaryFiles and sc.pickleFile.
>
>
>
> Any thoughts?
>
>
>
> Regards,
>
> Venkat Ankam
>
>
>
> This communication is the property of CenturyLink and may contain
> confidential or privileged information. Unauthorized use of this
> communication is strictly prohibited and may be unlawful. If you have
> received this communication in error, please immediately notify the sender
> by reply e-mail and destroy all copies of the communication and any
> attachments.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to