I need to process .wav files in Pyspark.  If the files are in local file 
system, I am able to process them.  Once I store them on HDFS, I am facing 
issues.  For example,

I run a sox program on a wav file like this.

sox ext2187854_03_27_2014.wav -n stats  <-- works fine

sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats   <-- 
Does not work as sox cannot read HDFS file.

So, I do like this.

hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | sox 
-t wav - -n stats  <-- This works fine

But, I am not able to do this in PySpark.

wavfile = 
sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))

I tried different options like sc.binaryFiles and sc.pickleFile.

Any thoughts?

Regards,
Venkat Ankam

This communication is the property of CenturyLink and may contain confidential 
or privileged information. Unauthorized use of this communication is strictly 
prohibited and may be unlawful. If you have received this communication in 
error, please immediately notify the sender by reply e-mail and destroy all 
copies of the communication and any attachments.

Reply via email to