subject:"Save RDD to HDFS using Spark Python API"

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Prashant Sharma

What Davies said is correct, second argument is hadoop's output format. Hadoop supports many type of output format's and all of them have their own advantages. Apart from the one specified above, https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Re: Save RDD to HDFS using Spark Python API

2016-04-26 Thread Davies Liu

hdfs://192.168.10.130:9000/dev/output/test already exists, so you need to remove it first. On Tue, Apr 26, 2016 at 5:28 AM, Luke Adolph wrote: > Hi, all: > Below is my code: > > from pyspark import * > import re > > def getDateByLine(input_str): > str_pattern =

Save RDD to HDFS using Spark Python API

2016-04-26 Thread Luke Adolph

Hi, all: Below is my code: from pyspark import *import re def getDateByLine(input_str): str_pattern = '^\d{4}-\d{2}-\d{2}' pattern = re.compile(str_pattern) match = pattern.match(input_str) if match: return match.group() else: return None file_url =