Re: Why are there different "parts" in my CSV?

Sean Owen Sat, 14 Feb 2015 03:21:51 -0800

Keep in mind that if you repartition to 1 partition, you are only
using 1 task to write the output, and potentially only 1 task to
compute some parent RDDs. You lose parallelism.  The
files-in-a-directory output scheme is standard for Hadoop and for a
reason.


Therefore I would consider separating this concern and merging the
files afterwards if you need to.

On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote:
> Simplest way would be to merge the output files at the end of your job like:
>
> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>
> If you want to do it pro grammatically, then you can use the
> FileUtil.copyMerge API
> . like:
>
> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
> of destination(hdfs), Path to the merged files /merged-ouput, true(to delete
> the original dir),null)
>
>
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote:
>>
>> Thanks Akhil for the suggestion, it is now only giving me one part - xxxx.
>> Is there anyway I can just create a file rather than a directory? It doesn't
>> seem like there is just a saveAsTextFile option for JavaPairRecieverDstream.
>>
>> Also, for the copy/merge api, how would I add that to my spark job?
>>
>> Thanks Akhil!
>>
>> Best,
>>
>> Su
>>
>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>>
>>> For streaming application, for every batch it will create a new directory
>>> and puts the data in it. If you don't want to have multiple files inside the
>>> directory as part-xxxx then you can do a repartition before the saveAs*
>>> call.
>>>
>>>
>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com> wrote:
>>>>
>>>> Hello Everyone,
>>>>
>>>> I am writing simple word counts to hdfs using
>>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>>> String.class, (Class) TextOutputFormat.class);
>>>>
>>>> 1) However, each 2 seconds I getting a new directory that is titled as a
>>>> csv. So i'll have test.csv, which will be a directory that has two files
>>>> inside of it called part-00000 and part 00001 (something like that). This
>>>> obv makes it very hard for me to read the data stored in the csv files. I 
>>>> am
>>>> wondering if there is a better way to store the JavaPairRecieverDStream and
>>>> JavaPairDStream?
>>>>
>>>> 2) I know there is a copy/merge hadoop api for merging files...can this
>>>> be done inside java? I am not sure the logic behind this api if I am using
>>>> spark streaming which is constantly making new files.
>>>>
>>>> Thanks a lot for the help!
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why are there different "parts" in my CSV?

Reply via email to