Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Su She
Okay, got it, thanks for the help Sean!


On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen  wrote:

> No, they appear as directories + files to everything. Lots of tools
> are used to taking an input that is a directory of part files though.
> You can certainly point MR, Hive, etc at a directory of these files.
>
> On Sat, Feb 14, 2015 at 9:05 PM, Su She  wrote:
> > Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
> > know if I understood this correctly, Spark Streamingwrites data like
> this:
> >
> > foo-1001.csv/part -x, part-x
> > foo-1002.csv/part -x, part-x
> >
> > When I see this on Hue, the csv's appear to me as directories, but if I
> > understand correctly, they will appear as csv files to other hadoop
> > ecosystem tools? And, if I understand Tathagata's answer correctly, other
> > hadoop based ecosystems, such as Hive, will be able to create a table
> based
> > of the multiple foo-10x.csv "directories"?
> >
> > Thank you, I really appreciate the help!
> >
> > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen  wrote:
> >>
> >> Keep in mind that if you repartition to 1 partition, you are only
> >> using 1 task to write the output, and potentially only 1 task to
> >> compute some parent RDDs. You lose parallelism.  The
> >> files-in-a-directory output scheme is standard for Hadoop and for a
> >> reason.
> >>
> >> Therefore I would consider separating this concern and merging the
> >> files afterwards if you need to.
> >>
> >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das 
> >> wrote:
> >> > Simplest way would be to merge the output files at the end of your job
> >> > like:
> >> >
> >> > hadoop fs -getmerge /output/dir/on/hdfs/
> /desired/local/output/file.txt
> >> >
> >> > If you want to do it pro grammatically, then you can use the
> >> > FileUtil.copyMerge API
> >> > . like:
> >> >
> >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
> >> > FileSystem
> >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
> >> > delete
> >> > the original dir),null)
> >> >
> >> >
> >> >
> >> > Thanks
> >> > Best Regards
> >> >
> >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She 
> wrote:
> >> >>
> >> >> Thanks Akhil for the suggestion, it is now only giving me one part -
> >> >> .
> >> >> Is there anyway I can just create a file rather than a directory? It
> >> >> doesn't
> >> >> seem like there is just a saveAsTextFile option for
> >> >> JavaPairRecieverDstream.
> >> >>
> >> >> Also, for the copy/merge api, how would I add that to my spark job?
> >> >>
> >> >> Thanks Akhil!
> >> >>
> >> >> Best,
> >> >>
> >> >> Su
> >> >>
> >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> For streaming application, for every batch it will create a new
> >> >>> directory
> >> >>> and puts the data in it. If you don't want to have multiple files
> >> >>> inside the
> >> >>> directory as part- then you can do a repartition before the
> >> >>> saveAs*
> >> >>> call.
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >> >>> String.class, (Class) TextOutputFormat.class);
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Best Regards
> >> >>>
> >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She 
> >> >>> wrote:
> >> 
> >>  Hello Everyone,
> >> 
> >>  I am writing simple word counts to hdfs using
> >> 
> >> 
> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >>  String.class, (Class) TextOutputFormat.class);
> >> 
> >>  1) However, each 2 seconds I getting a new directory that is titled
> >>  as a
> >>  csv. So i'll have test.csv, which will be a directory that has two
> >>  files
> >>  inside of it called part-0 and part 1 (something like
> that).
> >>  This
> >>  obv makes it very hard for me to read the data stored in the csv
> >>  files. I am
> >>  wondering if there is a better way to store the
> >>  JavaPairRecieverDStream and
> >>  JavaPairDStream?
> >> 
> >>  2) I know there is a copy/merge hadoop api for merging files...can
> >>  this
> >>  be done inside java? I am not sure the logic behind this api if I
> am
> >>  using
> >>  spark streaming which is constantly making new files.
> >> 
> >>  Thanks a lot for the help!
> >> >>>
> >> >>>
> >> >>
> >> >
> >
> >
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Sean Owen
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a directory of these files.

On Sat, Feb 14, 2015 at 9:05 PM, Su She  wrote:
> Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
> know if I understood this correctly, Spark Streamingwrites data like this:
>
> foo-1001.csv/part -x, part-x
> foo-1002.csv/part -x, part-x
>
> When I see this on Hue, the csv's appear to me as directories, but if I
> understand correctly, they will appear as csv files to other hadoop
> ecosystem tools? And, if I understand Tathagata's answer correctly, other
> hadoop based ecosystems, such as Hive, will be able to create a table based
> of the multiple foo-10x.csv "directories"?
>
> Thank you, I really appreciate the help!
>
> On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen  wrote:
>>
>> Keep in mind that if you repartition to 1 partition, you are only
>> using 1 task to write the output, and potentially only 1 task to
>> compute some parent RDDs. You lose parallelism.  The
>> files-in-a-directory output scheme is standard for Hadoop and for a
>> reason.
>>
>> Therefore I would consider separating this concern and merging the
>> files afterwards if you need to.
>>
>> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das 
>> wrote:
>> > Simplest way would be to merge the output files at the end of your job
>> > like:
>> >
>> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>> >
>> > If you want to do it pro grammatically, then you can use the
>> > FileUtil.copyMerge API
>> > . like:
>> >
>> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>> > FileSystem
>> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
>> > delete
>> > the original dir),null)
>> >
>> >
>> >
>> > Thanks
>> > Best Regards
>> >
>> > On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
>> >>
>> >> Thanks Akhil for the suggestion, it is now only giving me one part -
>> >> .
>> >> Is there anyway I can just create a file rather than a directory? It
>> >> doesn't
>> >> seem like there is just a saveAsTextFile option for
>> >> JavaPairRecieverDstream.
>> >>
>> >> Also, for the copy/merge api, how would I add that to my spark job?
>> >>
>> >> Thanks Akhil!
>> >>
>> >> Best,
>> >>
>> >> Su
>> >>
>> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
>> >> 
>> >> wrote:
>> >>>
>> >>> For streaming application, for every batch it will create a new
>> >>> directory
>> >>> and puts the data in it. If you don't want to have multiple files
>> >>> inside the
>> >>> directory as part- then you can do a repartition before the
>> >>> saveAs*
>> >>> call.
>> >>>
>> >>>
>> >>>
>> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> >>> String.class, (Class) TextOutputFormat.class);
>> >>>
>> >>>
>> >>> Thanks
>> >>> Best Regards
>> >>>
>> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She 
>> >>> wrote:
>> 
>>  Hello Everyone,
>> 
>>  I am writing simple word counts to hdfs using
>> 
>>  messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>  String.class, (Class) TextOutputFormat.class);
>> 
>>  1) However, each 2 seconds I getting a new directory that is titled
>>  as a
>>  csv. So i'll have test.csv, which will be a directory that has two
>>  files
>>  inside of it called part-0 and part 1 (something like that).
>>  This
>>  obv makes it very hard for me to read the data stored in the csv
>>  files. I am
>>  wondering if there is a better way to store the
>>  JavaPairRecieverDStream and
>>  JavaPairDStream?
>> 
>>  2) I know there is a copy/merge hadoop api for merging files...can
>>  this
>>  be done inside java? I am not sure the logic behind this api if I am
>>  using
>>  spark streaming which is constantly making new files.
>> 
>>  Thanks a lot for the help!
>> >>>
>> >>>
>> >>
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Su She
Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
know if I understood this correctly, Spark Streamingwrites data like this:

foo-1001.csv/part -x, part-x
foo-1002.csv/part -x, part-x

When I see this on Hue, the csv's appear to me as *directories*, but if I
understand correctly, they will appear as csv *files* to other hadoop
ecosystem tools? And, if I understand Tathagata's answer correctly, other
hadoop based ecosystems, such as Hive, will be able to create a table based
of the multiple foo-10x.csv "directories"?

Thank you, I really appreciate the help!

On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen  wrote:

> Keep in mind that if you repartition to 1 partition, you are only
> using 1 task to write the output, and potentially only 1 task to
> compute some parent RDDs. You lose parallelism.  The
> files-in-a-directory output scheme is standard for Hadoop and for a
> reason.
>
> Therefore I would consider separating this concern and merging the
> files afterwards if you need to.
>
> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das 
> wrote:
> > Simplest way would be to merge the output files at the end of your job
> like:
> >
> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
> >
> > If you want to do it pro grammatically, then you can use the
> > FileUtil.copyMerge API
> > . like:
> >
> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
> FileSystem
> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
> delete
> > the original dir),null)
> >
> >
> >
> > Thanks
> > Best Regards
> >
> > On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
> >>
> >> Thanks Akhil for the suggestion, it is now only giving me one part -
> .
> >> Is there anyway I can just create a file rather than a directory? It
> doesn't
> >> seem like there is just a saveAsTextFile option for
> JavaPairRecieverDstream.
> >>
> >> Also, for the copy/merge api, how would I add that to my spark job?
> >>
> >> Thanks Akhil!
> >>
> >> Best,
> >>
> >> Su
> >>
> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das  >
> >> wrote:
> >>>
> >>> For streaming application, for every batch it will create a new
> directory
> >>> and puts the data in it. If you don't want to have multiple files
> inside the
> >>> directory as part- then you can do a repartition before the saveAs*
> >>> call.
> >>>
> >>>
> >>>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> >>> String.class, (Class) TextOutputFormat.class);
> >>>
> >>>
> >>> Thanks
> >>> Best Regards
> >>>
> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She 
> wrote:
> 
>  Hello Everyone,
> 
>  I am writing simple word counts to hdfs using
>  messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>  String.class, (Class) TextOutputFormat.class);
> 
>  1) However, each 2 seconds I getting a new directory that is titled
> as a
>  csv. So i'll have test.csv, which will be a directory that has two
> files
>  inside of it called part-0 and part 1 (something like that).
> This
>  obv makes it very hard for me to read the data stored in the csv
> files. I am
>  wondering if there is a better way to store the
> JavaPairRecieverDStream and
>  JavaPairDStream?
> 
>  2) I know there is a copy/merge hadoop api for merging files...can
> this
>  be done inside java? I am not sure the logic behind this api if I am
> using
>  spark streaming which is constantly making new files.
> 
>  Thanks a lot for the help!
> >>>
> >>>
> >>
> >
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Sean Owen
Keep in mind that if you repartition to 1 partition, you are only
using 1 task to write the output, and potentially only 1 task to
compute some parent RDDs. You lose parallelism.  The
files-in-a-directory output scheme is standard for Hadoop and for a
reason.

Therefore I would consider separating this concern and merging the
files afterwards if you need to.

On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das  wrote:
> Simplest way would be to merge the output files at the end of your job like:
>
> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>
> If you want to do it pro grammatically, then you can use the
> FileUtil.copyMerge API
> . like:
>
> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
> of destination(hdfs), Path to the merged files /merged-ouput, true(to delete
> the original dir),null)
>
>
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
>>
>> Thanks Akhil for the suggestion, it is now only giving me one part - .
>> Is there anyway I can just create a file rather than a directory? It doesn't
>> seem like there is just a saveAsTextFile option for JavaPairRecieverDstream.
>>
>> Also, for the copy/merge api, how would I add that to my spark job?
>>
>> Thanks Akhil!
>>
>> Best,
>>
>> Su
>>
>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
>> wrote:
>>>
>>> For streaming application, for every batch it will create a new directory
>>> and puts the data in it. If you don't want to have multiple files inside the
>>> directory as part- then you can do a repartition before the saveAs*
>>> call.
>>>
>>>
>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:

 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new directory that is titled as a
 csv. So i'll have test.csv, which will be a directory that has two files
 inside of it called part-0 and part 1 (something like that). This
 obv makes it very hard for me to read the data stored in the csv files. I 
 am
 wondering if there is a better way to store the JavaPairRecieverDStream and
 JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!
>>>
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Su She
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark

Just read this...seems like it should be easily readable. Thanks!


On Sat, Feb 14, 2015 at 1:36 AM, Su She  wrote:

> Thanks Akhil for the link. Is there a reason why there is a new directory
> created for each batch? Is this a format that is easily readable by other
> applications such as hive/impala?
>
>
> On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das 
> wrote:
>
>> You can directly write to hbase with Spark. Here's and example for doing
>> that https://issues.apache.org/jira/browse/SPARK-944
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Feb 14, 2015 at 2:55 PM, Su She  wrote:
>>
>>> Hello Akhil, thank you for your continued help!
>>>
>>> 1) So, if I can write it in programitically after every batch, then
>>> technically I should be able to have just the csv files in one directory.
>>> However, can the /desired/output/file.txt be in hdfs? If it is only local,
>>> I am not sure if it will help me for my use case I describe in 2)
>>>
>>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
>>> desired/dir/in/hdfs ?
>>>
>>> 2) Just to make sure I am going on the right path...my end use case is
>>> to use hive or hbase to create a database off these csv files. Is there an
>>> easy way for hive to read /user/test/many sub directories/with one csv file
>>> in each into a table?
>>>
>>> Thank you!
>>>
>>>
>>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das 
>>> wrote:
>>>
 Simplest way would be to merge the output files at the end of your job
 like:

 hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

 ​If you want to do it pro grammatically, then you can use the ​
 FileUtil.copyMerge API
 ​.​ like:

 FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
 FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
 true(to delete the original dir),null)



 Thanks
 Best Regards

 On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:

> Thanks Akhil for the suggestion, it is now only giving me one part -
> . Is there anyway I can just create a file rather than a directory? It
> doesn't seem like there is just a saveAsTextFile option for
> JavaPairRecieverDstream.
>
> Also, for the copy/merge api, how would I add that to my spark job?
>
> Thanks Akhil!
>
> Best,
>
> Su
>
> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das <
> ak...@sigmoidanalytics.com> wrote:
>
>> For streaming application, for every batch it will create a new
>> directory and puts the data in it. If you don't want to have multiple 
>> files
>> inside the directory as part- then you can do a repartition before 
>> the
>> saveAs* call.
>>
>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Feb 13, 2015 at 11:59 AM, Su She 
>> wrote:
>>
>>> Hello Everyone,
>>>
>>> I am writing simple word counts to hdfs using
>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>> 1) However, each 2 seconds I getting a new *directory *that is
>>> titled as a csv. So i'll have test.csv, which will be a directory that 
>>> has
>>> two files inside of it called part-0 and part 1 (something like
>>> that). This obv makes it very hard for me to read the data stored in the
>>> csv files. I am wondering if there is a better way to store the
>>> JavaPairRecieverDStream and JavaPairDStream?
>>>
>>> 2) I know there is a copy/merge hadoop api for merging files...can
>>> this be done inside java? I am not sure the logic behind this api if I 
>>> am
>>> using spark streaming which is constantly making new files.
>>>
>>> Thanks a lot for the help!
>>>
>>
>>
>

>>>
>>
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Su She
Thanks Akhil for the link. Is there a reason why there is a new directory
created for each batch? Is this a format that is easily readable by other
applications such as hive/impala?


On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das 
wrote:

> You can directly write to hbase with Spark. Here's and example for doing
> that https://issues.apache.org/jira/browse/SPARK-944
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 2:55 PM, Su She  wrote:
>
>> Hello Akhil, thank you for your continued help!
>>
>> 1) So, if I can write it in programitically after every batch, then
>> technically I should be able to have just the csv files in one directory.
>> However, can the /desired/output/file.txt be in hdfs? If it is only local,
>> I am not sure if it will help me for my use case I describe in 2)
>>
>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
>> desired/dir/in/hdfs ?
>>
>> 2) Just to make sure I am going on the right path...my end use case is to
>> use hive or hbase to create a database off these csv files. Is there an
>> easy way for hive to read /user/test/many sub directories/with one csv file
>> in each into a table?
>>
>> Thank you!
>>
>>
>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das 
>> wrote:
>>
>>> Simplest way would be to merge the output files at the end of your job
>>> like:
>>>
>>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>>>
>>> ​If you want to do it pro grammatically, then you can use the ​
>>> FileUtil.copyMerge API
>>> ​.​ like:
>>>
>>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
>>> true(to delete the original dir),null)
>>>
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
>>>
 Thanks Akhil for the suggestion, it is now only giving me one part -
 . Is there anyway I can just create a file rather than a directory? It
 doesn't seem like there is just a saveAsTextFile option for
 JavaPairRecieverDstream.

 Also, for the copy/merge api, how would I add that to my spark job?

 Thanks Akhil!

 Best,

 Su

 On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >>> > wrote:

> For streaming application, for every batch it will create a new
> directory and puts the data in it. If you don't want to have multiple 
> files
> inside the directory as part- then you can do a repartition before the
> saveAs* call.
>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
>
> Thanks
> Best Regards
>
> On Fri, Feb 13, 2015 at 11:59 AM, Su She 
> wrote:
>
>> Hello Everyone,
>>
>> I am writing simple word counts to hdfs using
>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>> 1) However, each 2 seconds I getting a new *directory *that is
>> titled as a csv. So i'll have test.csv, which will be a directory that 
>> has
>> two files inside of it called part-0 and part 1 (something like
>> that). This obv makes it very hard for me to read the data stored in the
>> csv files. I am wondering if there is a better way to store the
>> JavaPairRecieverDStream and JavaPairDStream?
>>
>> 2) I know there is a copy/merge hadoop api for merging files...can
>> this be done inside java? I am not sure the logic behind this api if I am
>> using spark streaming which is constantly making new files.
>>
>> Thanks a lot for the help!
>>
>
>

>>>
>>
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Akhil Das
You can directly write to hbase with Spark. Here's and example for doing
that https://issues.apache.org/jira/browse/SPARK-944

Thanks
Best Regards

On Sat, Feb 14, 2015 at 2:55 PM, Su She  wrote:

> Hello Akhil, thank you for your continued help!
>
> 1) So, if I can write it in programitically after every batch, then
> technically I should be able to have just the csv files in one directory.
> However, can the /desired/output/file.txt be in hdfs? If it is only local,
> I am not sure if it will help me for my use case I describe in 2)
>
> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
> desired/dir/in/hdfs ?
>
> 2) Just to make sure I am going on the right path...my end use case is to
> use hive or hbase to create a database off these csv files. Is there an
> easy way for hive to read /user/test/many sub directories/with one csv file
> in each into a table?
>
> Thank you!
>
>
> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das 
> wrote:
>
>> Simplest way would be to merge the output files at the end of your job
>> like:
>>
>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>>
>> ​If you want to do it pro grammatically, then you can use the ​
>> FileUtil.copyMerge API
>> ​.​ like:
>>
>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
>> true(to delete the original dir),null)
>>
>>
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
>>
>>> Thanks Akhil for the suggestion, it is now only giving me one part -
>>> . Is there anyway I can just create a file rather than a directory? It
>>> doesn't seem like there is just a saveAsTextFile option for
>>> JavaPairRecieverDstream.
>>>
>>> Also, for the copy/merge api, how would I add that to my spark job?
>>>
>>> Thanks Akhil!
>>>
>>> Best,
>>>
>>> Su
>>>
>>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
>>> wrote:
>>>
 For streaming application, for every batch it will create a new
 directory and puts the data in it. If you don't want to have multiple files
 inside the directory as part- then you can do a repartition before the
 saveAs* call.

 messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
 String.class, (Class) TextOutputFormat.class);


 Thanks
 Best Regards

 On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:

> Hello Everyone,
>
> I am writing simple word counts to hdfs using
> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
> 1) However, each 2 seconds I getting a new *directory *that is titled
> as a csv. So i'll have test.csv, which will be a directory that has two
> files inside of it called part-0 and part 1 (something like that).
> This obv makes it very hard for me to read the data stored in the csv
> files. I am wondering if there is a better way to store the
> JavaPairRecieverDStream and JavaPairDStream?
>
> 2) I know there is a copy/merge hadoop api for merging files...can
> this be done inside java? I am not sure the logic behind this api if I am
> using spark streaming which is constantly making new files.
>
> Thanks a lot for the help!
>


>>>
>>
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Su She
Hello Akhil, thank you for your continued help!

1) So, if I can write it in programitically after every batch, then
technically I should be able to have just the csv files in one directory.
However, can the /desired/output/file.txt be in hdfs? If it is only local,
I am not sure if it will help me for my use case I describe in 2)

so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs
desired/dir/in/hdfs ?

2) Just to make sure I am going on the right path...my end use case is to
use hive or hbase to create a database off these csv files. Is there an
easy way for hive to read /user/test/many sub directories/with one csv file
in each into a table?

Thank you!


On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das 
wrote:

> Simplest way would be to merge the output files at the end of your job
> like:
>
> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>
> ​If you want to do it pro grammatically, then you can use the ​
> FileUtil.copyMerge API
> ​.​ like:
>
> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
> FileSystem of destination(hdfs), Path to the merged files /merged-ouput,
> true(to delete the original dir),null)
>
>
>
> Thanks
> Best Regards
>
> On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:
>
>> Thanks Akhil for the suggestion, it is now only giving me one part -
>> . Is there anyway I can just create a file rather than a directory? It
>> doesn't seem like there is just a saveAsTextFile option for
>> JavaPairRecieverDstream.
>>
>> Also, for the copy/merge api, how would I add that to my spark job?
>>
>> Thanks Akhil!
>>
>> Best,
>>
>> Su
>>
>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
>> wrote:
>>
>>> For streaming application, for every batch it will create a new
>>> directory and puts the data in it. If you don't want to have multiple files
>>> inside the directory as part- then you can do a repartition before the
>>> saveAs* call.
>>>
>>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:
>>>
 Hello Everyone,

 I am writing simple word counts to hdfs using
 messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
 String.class, (Class) TextOutputFormat.class);

 1) However, each 2 seconds I getting a new *directory *that is titled
 as a csv. So i'll have test.csv, which will be a directory that has two
 files inside of it called part-0 and part 1 (something like that).
 This obv makes it very hard for me to read the data stored in the csv
 files. I am wondering if there is a better way to store the
 JavaPairRecieverDStream and JavaPairDStream?

 2) I know there is a copy/merge hadoop api for merging files...can this
 be done inside java? I am not sure the logic behind this api if I am using
 spark streaming which is constantly making new files.

 Thanks a lot for the help!

>>>
>>>
>>
>


Re: Why are there different "parts" in my CSV?

2015-02-14 Thread Akhil Das
Simplest way would be to merge the output files at the end of your job like:

hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

​If you want to do it pro grammatically, then you can use the ​
FileUtil.copyMerge API
​.​ like:

FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem
of destination(hdfs), Path to the merged files /merged-ouput, true(to
delete the original dir),null)



Thanks
Best Regards

On Sat, Feb 14, 2015 at 2:18 AM, Su She  wrote:

> Thanks Akhil for the suggestion, it is now only giving me one part - .
> Is there anyway I can just create a file rather than a directory? It
> doesn't seem like there is just a saveAsTextFile option for
> JavaPairRecieverDstream.
>
> Also, for the copy/merge api, how would I add that to my spark job?
>
> Thanks Akhil!
>
> Best,
>
> Su
>
> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
> wrote:
>
>> For streaming application, for every batch it will create a new directory
>> and puts the data in it. If you don't want to have multiple files inside
>> the directory as part- then you can do a repartition before the saveAs*
>> call.
>>
>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:
>>
>>> Hello Everyone,
>>>
>>> I am writing simple word counts to hdfs using
>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>>> String.class, (Class) TextOutputFormat.class);
>>>
>>> 1) However, each 2 seconds I getting a new *directory *that is titled
>>> as a csv. So i'll have test.csv, which will be a directory that has two
>>> files inside of it called part-0 and part 1 (something like that).
>>> This obv makes it very hard for me to read the data stored in the csv
>>> files. I am wondering if there is a better way to store the
>>> JavaPairRecieverDStream and JavaPairDStream?
>>>
>>> 2) I know there is a copy/merge hadoop api for merging files...can this
>>> be done inside java? I am not sure the logic behind this api if I am using
>>> spark streaming which is constantly making new files.
>>>
>>> Thanks a lot for the help!
>>>
>>
>>
>


Re: Why are there different "parts" in my CSV?

2015-02-13 Thread Su She
Thanks Akhil for the suggestion, it is now only giving me one part - .
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.

Also, for the copy/merge api, how would I add that to my spark job?

Thanks Akhil!

Best,

Su

On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das 
wrote:

> For streaming application, for every batch it will create a new directory
> and puts the data in it. If you don't want to have multiple files inside
> the directory as part- then you can do a repartition before the saveAs*
> call.
>
> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
>
> Thanks
> Best Regards
>
> On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:
>
>> Hello Everyone,
>>
>> I am writing simple word counts to hdfs using
>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> String.class, (Class) TextOutputFormat.class);
>>
>> 1) However, each 2 seconds I getting a new *directory *that is titled as
>> a csv. So i'll have test.csv, which will be a directory that has two files
>> inside of it called part-0 and part 1 (something like that). This
>> obv makes it very hard for me to read the data stored in the csv files. I
>> am wondering if there is a better way to store the JavaPairRecieverDStream
>> and JavaPairDStream?
>>
>> 2) I know there is a copy/merge hadoop api for merging files...can this
>> be done inside java? I am not sure the logic behind this api if I am using
>> spark streaming which is constantly making new files.
>>
>> Thanks a lot for the help!
>>
>
>


Re: Why are there different "parts" in my CSV?

2015-02-12 Thread Akhil Das
For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part- then you can do a repartition before the saveAs*
call.

messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);


Thanks
Best Regards

On Fri, Feb 13, 2015 at 11:59 AM, Su She  wrote:

> Hello Everyone,
>
> I am writing simple word counts to hdfs using
> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
> String.class, (Class) TextOutputFormat.class);
>
> 1) However, each 2 seconds I getting a new *directory *that is titled as
> a csv. So i'll have test.csv, which will be a directory that has two files
> inside of it called part-0 and part 1 (something like that). This
> obv makes it very hard for me to read the data stored in the csv files. I
> am wondering if there is a better way to store the JavaPairRecieverDStream
> and JavaPairDStream?
>
> 2) I know there is a copy/merge hadoop api for merging files...can this be
> done inside java? I am not sure the logic behind this api if I am using
> spark streaming which is constantly making new files.
>
> Thanks a lot for the help!
>


Why are there different "parts" in my CSV?

2015-02-12 Thread Su She
Hello Everyone,

I am writing simple word counts to hdfs using
messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);

1) However, each 2 seconds I getting a new *directory *that is titled as a
csv. So i'll have test.csv, which will be a directory that has two files
inside of it called part-0 and part 1 (something like that). This
obv makes it very hard for me to read the data stored in the csv files. I
am wondering if there is a better way to store the JavaPairRecieverDStream
and JavaPairDStream?

2) I know there is a copy/merge hadoop api for merging files...can this be
done inside java? I am not sure the logic behind this api if I am using
spark streaming which is constantly making new files.

Thanks a lot for the help!