Re: Why are there different "parts" in my CSV?

Su She Sat, 14 Feb 2015 13:15:06 -0800

Okay, got it, thanks for the help Sean!


On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen <so...@cloudera.com> wrote:

> No, they appear as directories + files to everything. Lots of tools
> are used to taking an input that is a directory of part files though.
> You can certainly point MR, Hive, etc at a directory of these files.
>
> On Sat, Feb 14, 2015 at 9:05 PM, Su She <suhsheka...@gmail.com> wrote:
> > Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
> > know if I understood this correctly, Spark Streamingwrites data like
> this:
> >
> > foo-10000001.csv/part -xxxxx, part-xxxxx
> > foo-10000002.csv/part -xxxxx, part-xxxxx
> >
> > When I see this on Hue, the csv's appear to me as directories, but if I
> > understand correctly, they will appear as csv files to other hadoop
> > ecosystem tools? And, if I understand Tathagata's answer correctly, other
> > hadoop based ecosystems, such as Hive, will be able to create a table
> based
> > of the multiple foo-100000x.csv "directories"?
> >
> > Thank you, I really appreciate the help!
> >
> > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> Keep in mind that if you repartition to 1 partition, you are only
> >> using 1 task to write the output, and potentially only 1 task to
> >> compute some parent RDDs. You lose parallelism.  The
> >> files-in-a-directory output scheme is standard for Hadoop and for a
> >> reason.
> >>
> >> Therefore I would consider separating this concern and merging the
> >> files afterwards if you need to.
> >>
> >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <ak...@sigmoidanalytics.com>
> >> wrote:
> >> > Simplest way would be to merge the output files at the end of your job
> >> > like:
> >> >
> >> > hadoop fs -getmerge /output/dir/on/hdfs/
> /desired/local/output/file.txt
> >> >
> >> > If you want to do it pro grammatically, then you can use the
> >> > FileUtil.copyMerge API
> >> > . like:
> >> >
> >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
> >> > FileSystem
> >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
> >> > delete
> >> > the original dir),null)
> >> >
> >> >
> >> >
> >> > Thanks
> >> > Best Regards
> >> >
> >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhsheka...@gmail.com>
> wrote:
> >> >>
> >> >> Thanks Akhil for the suggestion, it is now only giving me one part -
> >> >> xxxx.
> >> >> Is there anyway I can just create a file rather than a directory? It
> >> >> doesn't
> >> >> seem like there is just a saveAsTextFile option for
> >> >> JavaPairRecieverDstream.
> >> >>
> >> >> Also, for the copy/merge api, how would I add that to my spark job?
> >> >>
> >> >> Thanks Akhil!
> >> >>
> >> >> Best,
> >> >>
> >> >> Su
> >> >>
> >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
> >> >> <ak...@sigmoidanalytics.com>
> >> >> wrote:
> >> >>>
> >> >>> For streaming application, for every batch it will create a new
> >> >>> directory
> >> >>> and puts the data in it. If you don't want to have multiple files
> >> >>> inside the
> >> >>> directory as part-xxxx then you can do a repartition before the
> >> >>> saveAs*
> >> >>> call.
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >> >>> String.class, (Class) TextOutputFormat.class);
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Best Regards
> >> >>>
> >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhsheka...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello Everyone,
> >> >>>>
> >> >>>> I am writing simple word counts to hdfs using
> >> >>>>
> >> >>>>
> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >> >>>> String.class, (Class) TextOutputFormat.class);
> >> >>>>
> >> >>>> 1) However, each 2 seconds I getting a new directory that is titled
> >> >>>> as a
> >> >>>> csv. So i'll have test.csv, which will be a directory that has two
> >> >>>> files
> >> >>>> inside of it called part-00000 and part 00001 (something like
> that).
> >> >>>> This
> >> >>>> obv makes it very hard for me to read the data stored in the csv
> >> >>>> files. I am
> >> >>>> wondering if there is a better way to store the
> >> >>>> JavaPairRecieverDStream and
> >> >>>> JavaPairDStream?
> >> >>>>
> >> >>>> 2) I know there is a copy/merge hadoop api for merging files...can
> >> >>>> this
> >> >>>> be done inside java? I am not sure the logic behind this api if I
> am
> >> >>>> using
> >> >>>> spark streaming which is constantly making new files.
> >> >>>>
> >> >>>> Thanks a lot for the help!
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>

Re: Why are there different "parts" in my CSV?

Reply via email to