Re: Why are there different "parts" in my CSV?

Su She Sat, 14 Feb 2015 13:06:56 -0800

Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
know if I understood this correctly, Spark Streamingwrites data like this:


foo-10000001.csv/part -xxxxx, part-xxxxx
foo-10000002.csv/part -xxxxx, part-xxxxx

When I see this on Hue, the csv's appear to me as *directories*, but if I
understand correctly, they will appear as csv *files* to other hadoop
ecosystem tools? And, if I understand Tathagata's answer correctly, other
hadoop based ecosystems, such as Hive, will be able to create a table based
of the multiple foo-100000x.csv "directories"?

Thank you, I really appreciate the help!

On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <so...@cloudera.com> wrote:

> Keep in mind that if you repartition to 1 partition, you are only
> using 1 task to write the output, and potentially only 1 task to
> compute some parent RDDs. You lose parallelism.  The
> files-in-a-directory output scheme is standard for Hadoop and for a
> reason.
>
> Therefore I would consider separating this concern and merging the
> files afterwards if you need to.
>
> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
> > Simplest way would be to merge the output files at the end of your job
> like:
> >
> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
> >
> > If you want to do it pro grammatically, then you can use the
> > FileUtil.copyMerge API
> > . like:
> >
> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
> FileSystem
> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
> delete
> > the original dir),null)
> >
> >
> >
> > Thanks
> > Best Regards
> >
> > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com> wrote:
> >>
> >> Thanks Akhil for the suggestion, it is now only giving me one part -
> xxxx.
> >> Is there anyway I can just create a file rather than a directory? It
> doesn't
> >> seem like there is just a saveAsTextFile option for
> JavaPairRecieverDstream.
> >>
> >> Also, for the copy/merge api, how would I add that to my spark job?
> >>
> >> Thanks Akhil!
> >>
> >> Best,
> >>
> >> Su
> >>
> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <ak...@sigmoidanalytics.com
> >
> >> wrote:
> >>>
> >>> For streaming application, for every batch it will create a new
> directory
> >>> and puts the data in it. If you don't want to have multiple files
> inside the
> >>> directory as part-xxxx then you can do a repartition before the saveAs*
> >>> call.
> >>>
> >>>
> >>>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >>> String.class, (Class) TextOutputFormat.class);
> >>>
> >>>
> >>> Thanks
> >>> Best Regards
> >>>
> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com>
> wrote:
> >>>>
> >>>> Hello Everyone,
> >>>>
> >>>> I am writing simple word counts to hdfs using
> >>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >>>> String.class, (Class) TextOutputFormat.class);
> >>>>
> >>>> 1) However, each 2 seconds I getting a new directory that is titled
> as a
> >>>> csv. So i'll have test.csv, which will be a directory that has two
> files
> >>>> inside of it called part-00000 and part 00001 (something like that).
> This
> >>>> obv makes it very hard for me to read the data stored in the csv
> files. I am
> >>>> wondering if there is a better way to store the
> JavaPairRecieverDStream and
> >>>> JavaPairDStream?
> >>>>
> >>>> 2) I know there is a copy/merge hadoop api for merging files...can
> this
> >>>> be done inside java? I am not sure the logic behind this api if I am
> using
> >>>> spark streaming which is constantly making new files.
> >>>>
> >>>> Thanks a lot for the help!
> >>>
> >>>
> >>
> >
>

Re: Why are there different "parts" in my CSV?

Reply via email to