Re: Why are there different "parts" in my CSV?
Okay, got it, thanks for the help Sean! On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen wrote: > No, they appear as directories + files to everything. Lots of tools > are used to taking an input that is a directory of part files though. > You can certainly point MR, Hive, etc at a directory of these files. > > On Sat, Feb 14, 2015 at 9:05 PM, Su She wrote: > > Thanks Sean and Akhil! I will take out the repartition(1). Please let me > > know if I understood this correctly, Spark Streamingwrites data like > this: > > > > foo-1001.csv/part -x, part-x > > foo-1002.csv/part -x, part-x > > > > When I see this on Hue, the csv's appear to me as directories, but if I > > understand correctly, they will appear as csv files to other hadoop > > ecosystem tools? And, if I understand Tathagata's answer correctly, other > > hadoop based ecosystems, such as Hive, will be able to create a table > based > > of the multiple foo-10x.csv "directories"? > > > > Thank you, I really appreciate the help! > > > > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen wrote: > >> > >> Keep in mind that if you repartition to 1 partition, you are only > >> using 1 task to write the output, and potentially only 1 task to > >> compute some parent RDDs. You lose parallelism. The > >> files-in-a-directory output scheme is standard for Hadoop and for a > >> reason. > >> > >> Therefore I would consider separating this concern and merging the > >> files afterwards if you need to. > >> > >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das > >> wrote: > >> > Simplest way would be to merge the output files at the end of your job > >> > like: > >> > > >> > hadoop fs -getmerge /output/dir/on/hdfs/ > /desired/local/output/file.txt > >> > > >> > If you want to do it pro grammatically, then you can use the > >> > FileUtil.copyMerge API > >> > . like: > >> > > >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, > >> > FileSystem > >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to > >> > delete > >> > the original dir),null) > >> > > >> > > >> > > >> > Thanks > >> > Best Regards > >> > > >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She > wrote: > >> >> > >> >> Thanks Akhil for the suggestion, it is now only giving me one part - > >> >> . > >> >> Is there anyway I can just create a file rather than a directory? It > >> >> doesn't > >> >> seem like there is just a saveAsTextFile option for > >> >> JavaPairRecieverDstream. > >> >> > >> >> Also, for the copy/merge api, how would I add that to my spark job? > >> >> > >> >> Thanks Akhil! > >> >> > >> >> Best, > >> >> > >> >> Su > >> >> > >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das > >> >> > >> >> wrote: > >> >>> > >> >>> For streaming application, for every batch it will create a new > >> >>> directory > >> >>> and puts the data in it. If you don't want to have multiple files > >> >>> inside the > >> >>> directory as part- then you can do a repartition before the > >> >>> saveAs* > >> >>> call. > >> >>> > >> >>> > >> >>> > >> >>> > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >> >>> String.class, (Class) TextOutputFormat.class); > >> >>> > >> >>> > >> >>> Thanks > >> >>> Best Regards > >> >>> > >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She > >> >>> wrote: > >> > >> Hello Everyone, > >> > >> I am writing simple word counts to hdfs using > >> > >> > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >> String.class, (Class) TextOutputFormat.class); > >> > >> 1) However, each 2 seconds I getting a new directory that is titled > >> as a > >> csv. So i'll have test.csv, which will be a directory that has two > >> files > >> inside of it called part-0 and part 1 (something like > that). > >> This > >> obv makes it very hard for me to read the data stored in the csv > >> files. I am > >> wondering if there is a better way to store the > >> JavaPairRecieverDStream and > >> JavaPairDStream? > >> > >> 2) I know there is a copy/merge hadoop api for merging files...can > >> this > >> be done inside java? I am not sure the logic behind this api if I > am > >> using > >> spark streaming which is constantly making new files. > >> > >> Thanks a lot for the help! > >> >>> > >> >>> > >> >> > >> > > > > > >
Re: Why are there different "parts" in my CSV?
No, they appear as directories + files to everything. Lots of tools are used to taking an input that is a directory of part files though. You can certainly point MR, Hive, etc at a directory of these files. On Sat, Feb 14, 2015 at 9:05 PM, Su She wrote: > Thanks Sean and Akhil! I will take out the repartition(1). Please let me > know if I understood this correctly, Spark Streamingwrites data like this: > > foo-1001.csv/part -x, part-x > foo-1002.csv/part -x, part-x > > When I see this on Hue, the csv's appear to me as directories, but if I > understand correctly, they will appear as csv files to other hadoop > ecosystem tools? And, if I understand Tathagata's answer correctly, other > hadoop based ecosystems, such as Hive, will be able to create a table based > of the multiple foo-10x.csv "directories"? > > Thank you, I really appreciate the help! > > On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen wrote: >> >> Keep in mind that if you repartition to 1 partition, you are only >> using 1 task to write the output, and potentially only 1 task to >> compute some parent RDDs. You lose parallelism. The >> files-in-a-directory output scheme is standard for Hadoop and for a >> reason. >> >> Therefore I would consider separating this concern and merging the >> files afterwards if you need to. >> >> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das >> wrote: >> > Simplest way would be to merge the output files at the end of your job >> > like: >> > >> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >> > >> > If you want to do it pro grammatically, then you can use the >> > FileUtil.copyMerge API >> > . like: >> > >> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >> > FileSystem >> > of destination(hdfs), Path to the merged files /merged-ouput, true(to >> > delete >> > the original dir),null) >> > >> > >> > >> > Thanks >> > Best Regards >> > >> > On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: >> >> >> >> Thanks Akhil for the suggestion, it is now only giving me one part - >> >> . >> >> Is there anyway I can just create a file rather than a directory? It >> >> doesn't >> >> seem like there is just a saveAsTextFile option for >> >> JavaPairRecieverDstream. >> >> >> >> Also, for the copy/merge api, how would I add that to my spark job? >> >> >> >> Thanks Akhil! >> >> >> >> Best, >> >> >> >> Su >> >> >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >> >> >> >> wrote: >> >>> >> >>> For streaming application, for every batch it will create a new >> >>> directory >> >>> and puts the data in it. If you don't want to have multiple files >> >>> inside the >> >>> directory as part- then you can do a repartition before the >> >>> saveAs* >> >>> call. >> >>> >> >>> >> >>> >> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> >>> String.class, (Class) TextOutputFormat.class); >> >>> >> >>> >> >>> Thanks >> >>> Best Regards >> >>> >> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She >> >>> wrote: >> >> Hello Everyone, >> >> I am writing simple word counts to hdfs using >> >> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> String.class, (Class) TextOutputFormat.class); >> >> 1) However, each 2 seconds I getting a new directory that is titled >> as a >> csv. So i'll have test.csv, which will be a directory that has two >> files >> inside of it called part-0 and part 1 (something like that). >> This >> obv makes it very hard for me to read the data stored in the csv >> files. I am >> wondering if there is a better way to store the >> JavaPairRecieverDStream and >> JavaPairDStream? >> >> 2) I know there is a copy/merge hadoop api for merging files...can >> this >> be done inside java? I am not sure the logic behind this api if I am >> using >> spark streaming which is constantly making new files. >> >> Thanks a lot for the help! >> >>> >> >>> >> >> >> > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why are there different "parts" in my CSV?
Thanks Sean and Akhil! I will take out the repartition(1). Please let me know if I understood this correctly, Spark Streamingwrites data like this: foo-1001.csv/part -x, part-x foo-1002.csv/part -x, part-x When I see this on Hue, the csv's appear to me as *directories*, but if I understand correctly, they will appear as csv *files* to other hadoop ecosystem tools? And, if I understand Tathagata's answer correctly, other hadoop based ecosystems, such as Hive, will be able to create a table based of the multiple foo-10x.csv "directories"? Thank you, I really appreciate the help! On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen wrote: > Keep in mind that if you repartition to 1 partition, you are only > using 1 task to write the output, and potentially only 1 task to > compute some parent RDDs. You lose parallelism. The > files-in-a-directory output scheme is standard for Hadoop and for a > reason. > > Therefore I would consider separating this concern and merging the > files afterwards if you need to. > > On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das > wrote: > > Simplest way would be to merge the output files at the end of your job > like: > > > > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt > > > > If you want to do it pro grammatically, then you can use the > > FileUtil.copyMerge API > > . like: > > > > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, > FileSystem > > of destination(hdfs), Path to the merged files /merged-ouput, true(to > delete > > the original dir),null) > > > > > > > > Thanks > > Best Regards > > > > On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: > >> > >> Thanks Akhil for the suggestion, it is now only giving me one part - > . > >> Is there anyway I can just create a file rather than a directory? It > doesn't > >> seem like there is just a saveAsTextFile option for > JavaPairRecieverDstream. > >> > >> Also, for the copy/merge api, how would I add that to my spark job? > >> > >> Thanks Akhil! > >> > >> Best, > >> > >> Su > >> > >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das > > >> wrote: > >>> > >>> For streaming application, for every batch it will create a new > directory > >>> and puts the data in it. If you don't want to have multiple files > inside the > >>> directory as part- then you can do a repartition before the saveAs* > >>> call. > >>> > >>> > >>> > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > >>> String.class, (Class) TextOutputFormat.class); > >>> > >>> > >>> Thanks > >>> Best Regards > >>> > >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She > wrote: > > Hello Everyone, > > I am writing simple word counts to hdfs using > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > 1) However, each 2 seconds I getting a new directory that is titled > as a > csv. So i'll have test.csv, which will be a directory that has two > files > inside of it called part-0 and part 1 (something like that). > This > obv makes it very hard for me to read the data stored in the csv > files. I am > wondering if there is a better way to store the > JavaPairRecieverDStream and > JavaPairDStream? > > 2) I know there is a copy/merge hadoop api for merging files...can > this > be done inside java? I am not sure the logic behind this api if I am > using > spark streaming which is constantly making new files. > > Thanks a lot for the help! > >>> > >>> > >> > > >
Re: Why are there different "parts" in my CSV?
Keep in mind that if you repartition to 1 partition, you are only using 1 task to write the output, and potentially only 1 task to compute some parent RDDs. You lose parallelism. The files-in-a-directory output scheme is standard for Hadoop and for a reason. Therefore I would consider separating this concern and merging the files afterwards if you need to. On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das wrote: > Simplest way would be to merge the output files at the end of your job like: > > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt > > If you want to do it pro grammatically, then you can use the > FileUtil.copyMerge API > . like: > > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem > of destination(hdfs), Path to the merged files /merged-ouput, true(to delete > the original dir),null) > > > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: >> >> Thanks Akhil for the suggestion, it is now only giving me one part - . >> Is there anyway I can just create a file rather than a directory? It doesn't >> seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. >> >> Also, for the copy/merge api, how would I add that to my spark job? >> >> Thanks Akhil! >> >> Best, >> >> Su >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >> wrote: >>> >>> For streaming application, for every batch it will create a new directory >>> and puts the data in it. If you don't want to have multiple files inside the >>> directory as part- then you can do a repartition before the saveAs* >>> call. >>> >>> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>> String.class, (Class) TextOutputFormat.class); >>> >>> >>> Thanks >>> Best Regards >>> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new directory that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help! >>> >>> >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Why are there different "parts" in my CSV?
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark Just read this...seems like it should be easily readable. Thanks! On Sat, Feb 14, 2015 at 1:36 AM, Su She wrote: > Thanks Akhil for the link. Is there a reason why there is a new directory > created for each batch? Is this a format that is easily readable by other > applications such as hive/impala? > > > On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das > wrote: > >> You can directly write to hbase with Spark. Here's and example for doing >> that https://issues.apache.org/jira/browse/SPARK-944 >> >> Thanks >> Best Regards >> >> On Sat, Feb 14, 2015 at 2:55 PM, Su She wrote: >> >>> Hello Akhil, thank you for your continued help! >>> >>> 1) So, if I can write it in programitically after every batch, then >>> technically I should be able to have just the csv files in one directory. >>> However, can the /desired/output/file.txt be in hdfs? If it is only local, >>> I am not sure if it will help me for my use case I describe in 2) >>> >>> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs >>> desired/dir/in/hdfs ? >>> >>> 2) Just to make sure I am going on the right path...my end use case is >>> to use hive or hbase to create a database off these csv files. Is there an >>> easy way for hive to read /user/test/many sub directories/with one csv file >>> in each into a table? >>> >>> Thank you! >>> >>> >>> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das >>> wrote: >>> Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: > Thanks Akhil for the suggestion, it is now only giving me one part - > . Is there anyway I can just create a file rather than a directory? It > doesn't seem like there is just a saveAsTextFile option for > JavaPairRecieverDstream. > > Also, for the copy/merge api, how would I add that to my spark job? > > Thanks Akhil! > > Best, > > Su > > On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das < > ak...@sigmoidanalytics.com> wrote: > >> For streaming application, for every batch it will create a new >> directory and puts the data in it. If you don't want to have multiple >> files >> inside the directory as part- then you can do a repartition before >> the >> saveAs* call. >> >> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> String.class, (Class) TextOutputFormat.class); >> >> >> Thanks >> Best Regards >> >> On Fri, Feb 13, 2015 at 11:59 AM, Su She >> wrote: >> >>> Hello Everyone, >>> >>> I am writing simple word counts to hdfs using >>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>> String.class, (Class) TextOutputFormat.class); >>> >>> 1) However, each 2 seconds I getting a new *directory *that is >>> titled as a csv. So i'll have test.csv, which will be a directory that >>> has >>> two files inside of it called part-0 and part 1 (something like >>> that). This obv makes it very hard for me to read the data stored in the >>> csv files. I am wondering if there is a better way to store the >>> JavaPairRecieverDStream and JavaPairDStream? >>> >>> 2) I know there is a copy/merge hadoop api for merging files...can >>> this be done inside java? I am not sure the logic behind this api if I >>> am >>> using spark streaming which is constantly making new files. >>> >>> Thanks a lot for the help! >>> >> >> > >>> >> >
Re: Why are there different "parts" in my CSV?
Thanks Akhil for the link. Is there a reason why there is a new directory created for each batch? Is this a format that is easily readable by other applications such as hive/impala? On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das wrote: > You can directly write to hbase with Spark. Here's and example for doing > that https://issues.apache.org/jira/browse/SPARK-944 > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 2:55 PM, Su She wrote: > >> Hello Akhil, thank you for your continued help! >> >> 1) So, if I can write it in programitically after every batch, then >> technically I should be able to have just the csv files in one directory. >> However, can the /desired/output/file.txt be in hdfs? If it is only local, >> I am not sure if it will help me for my use case I describe in 2) >> >> so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs >> desired/dir/in/hdfs ? >> >> 2) Just to make sure I am going on the right path...my end use case is to >> use hive or hbase to create a database off these csv files. Is there an >> easy way for hive to read /user/test/many sub directories/with one csv file >> in each into a table? >> >> Thank you! >> >> >> On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das >> wrote: >> >>> Simplest way would be to merge the output files at the end of your job >>> like: >>> >>> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >>> >>> If you want to do it pro grammatically, then you can use the >>> FileUtil.copyMerge API >>> . like: >>> >>> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >>> FileSystem of destination(hdfs), Path to the merged files /merged-ouput, >>> true(to delete the original dir),null) >>> >>> >>> >>> Thanks >>> Best Regards >>> >>> On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: >>> Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >>> > wrote: > For streaming application, for every batch it will create a new > directory and puts the data in it. If you don't want to have multiple > files > inside the directory as part- then you can do a repartition before the > saveAs* call. > > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > > Thanks > Best Regards > > On Fri, Feb 13, 2015 at 11:59 AM, Su She > wrote: > >> Hello Everyone, >> >> I am writing simple word counts to hdfs using >> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> String.class, (Class) TextOutputFormat.class); >> >> 1) However, each 2 seconds I getting a new *directory *that is >> titled as a csv. So i'll have test.csv, which will be a directory that >> has >> two files inside of it called part-0 and part 1 (something like >> that). This obv makes it very hard for me to read the data stored in the >> csv files. I am wondering if there is a better way to store the >> JavaPairRecieverDStream and JavaPairDStream? >> >> 2) I know there is a copy/merge hadoop api for merging files...can >> this be done inside java? I am not sure the logic behind this api if I am >> using spark streaming which is constantly making new files. >> >> Thanks a lot for the help! >> > > >>> >> >
Re: Why are there different "parts" in my CSV?
You can directly write to hbase with Spark. Here's and example for doing that https://issues.apache.org/jira/browse/SPARK-944 Thanks Best Regards On Sat, Feb 14, 2015 at 2:55 PM, Su She wrote: > Hello Akhil, thank you for your continued help! > > 1) So, if I can write it in programitically after every batch, then > technically I should be able to have just the csv files in one directory. > However, can the /desired/output/file.txt be in hdfs? If it is only local, > I am not sure if it will help me for my use case I describe in 2) > > so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs > desired/dir/in/hdfs ? > > 2) Just to make sure I am going on the right path...my end use case is to > use hive or hbase to create a database off these csv files. Is there an > easy way for hive to read /user/test/many sub directories/with one csv file > in each into a table? > > Thank you! > > > On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das > wrote: > >> Simplest way would be to merge the output files at the end of your job >> like: >> >> hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt >> >> If you want to do it pro grammatically, then you can use the >> FileUtil.copyMerge API >> . like: >> >> FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, >> FileSystem of destination(hdfs), Path to the merged files /merged-ouput, >> true(to delete the original dir),null) >> >> >> >> Thanks >> Best Regards >> >> On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: >> >>> Thanks Akhil for the suggestion, it is now only giving me one part - >>> . Is there anyway I can just create a file rather than a directory? It >>> doesn't seem like there is just a saveAsTextFile option for >>> JavaPairRecieverDstream. >>> >>> Also, for the copy/merge api, how would I add that to my spark job? >>> >>> Thanks Akhil! >>> >>> Best, >>> >>> Su >>> >>> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >>> wrote: >>> For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: > Hello Everyone, > > I am writing simple word counts to hdfs using > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > 1) However, each 2 seconds I getting a new *directory *that is titled > as a csv. So i'll have test.csv, which will be a directory that has two > files inside of it called part-0 and part 1 (something like that). > This obv makes it very hard for me to read the data stored in the csv > files. I am wondering if there is a better way to store the > JavaPairRecieverDStream and JavaPairDStream? > > 2) I know there is a copy/merge hadoop api for merging files...can > this be done inside java? I am not sure the logic behind this api if I am > using spark streaming which is constantly making new files. > > Thanks a lot for the help! > >>> >> >
Re: Why are there different "parts" in my CSV?
Hello Akhil, thank you for your continued help! 1) So, if I can write it in programitically after every batch, then technically I should be able to have just the csv files in one directory. However, can the /desired/output/file.txt be in hdfs? If it is only local, I am not sure if it will help me for my use case I describe in 2) so can i do something like this hadoop fs -getmerge /output/dir/on/hdfs desired/dir/in/hdfs ? 2) Just to make sure I am going on the right path...my end use case is to use hive or hbase to create a database off these csv files. Is there an easy way for hive to read /user/test/many sub directories/with one csv file in each into a table? Thank you! On Sat, Feb 14, 2015 at 12:39 AM, Akhil Das wrote: > Simplest way would be to merge the output files at the end of your job > like: > > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt > > If you want to do it pro grammatically, then you can use the > FileUtil.copyMerge API > . like: > > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, > FileSystem of destination(hdfs), Path to the merged files /merged-ouput, > true(to delete the original dir),null) > > > > Thanks > Best Regards > > On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: > >> Thanks Akhil for the suggestion, it is now only giving me one part - >> . Is there anyway I can just create a file rather than a directory? It >> doesn't seem like there is just a saveAsTextFile option for >> JavaPairRecieverDstream. >> >> Also, for the copy/merge api, how would I add that to my spark job? >> >> Thanks Akhil! >> >> Best, >> >> Su >> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das >> wrote: >> >>> For streaming application, for every batch it will create a new >>> directory and puts the data in it. If you don't want to have multiple files >>> inside the directory as part- then you can do a repartition before the >>> saveAs* call. >>> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>> String.class, (Class) TextOutputFormat.class); >>> >>> >>> Thanks >>> Best Regards >>> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: >>> Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help! >>> >>> >> >
Re: Why are there different "parts" in my CSV?
Simplest way would be to merge the output files at the end of your job like: hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt If you want to do it pro grammatically, then you can use the FileUtil.copyMerge API . like: FileUtil.copyMerge(FileSystem of source(hdfs), /output-location, FileSystem of destination(hdfs), Path to the merged files /merged-ouput, true(to delete the original dir),null) Thanks Best Regards On Sat, Feb 14, 2015 at 2:18 AM, Su She wrote: > Thanks Akhil for the suggestion, it is now only giving me one part - . > Is there anyway I can just create a file rather than a directory? It > doesn't seem like there is just a saveAsTextFile option for > JavaPairRecieverDstream. > > Also, for the copy/merge api, how would I add that to my spark job? > > Thanks Akhil! > > Best, > > Su > > On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das > wrote: > >> For streaming application, for every batch it will create a new directory >> and puts the data in it. If you don't want to have multiple files inside >> the directory as part- then you can do a repartition before the saveAs* >> call. >> >> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> String.class, (Class) TextOutputFormat.class); >> >> >> Thanks >> Best Regards >> >> On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: >> >>> Hello Everyone, >>> >>> I am writing simple word counts to hdfs using >>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >>> String.class, (Class) TextOutputFormat.class); >>> >>> 1) However, each 2 seconds I getting a new *directory *that is titled >>> as a csv. So i'll have test.csv, which will be a directory that has two >>> files inside of it called part-0 and part 1 (something like that). >>> This obv makes it very hard for me to read the data stored in the csv >>> files. I am wondering if there is a better way to store the >>> JavaPairRecieverDStream and JavaPairDStream? >>> >>> 2) I know there is a copy/merge hadoop api for merging files...can this >>> be done inside java? I am not sure the logic behind this api if I am using >>> spark streaming which is constantly making new files. >>> >>> Thanks a lot for the help! >>> >> >> >
Re: Why are there different "parts" in my CSV?
Thanks Akhil for the suggestion, it is now only giving me one part - . Is there anyway I can just create a file rather than a directory? It doesn't seem like there is just a saveAsTextFile option for JavaPairRecieverDstream. Also, for the copy/merge api, how would I add that to my spark job? Thanks Akhil! Best, Su On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das wrote: > For streaming application, for every batch it will create a new directory > and puts the data in it. If you don't want to have multiple files inside > the directory as part- then you can do a repartition before the saveAs* > call. > > messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > > Thanks > Best Regards > > On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: > >> Hello Everyone, >> >> I am writing simple word counts to hdfs using >> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, >> String.class, (Class) TextOutputFormat.class); >> >> 1) However, each 2 seconds I getting a new *directory *that is titled as >> a csv. So i'll have test.csv, which will be a directory that has two files >> inside of it called part-0 and part 1 (something like that). This >> obv makes it very hard for me to read the data stored in the csv files. I >> am wondering if there is a better way to store the JavaPairRecieverDStream >> and JavaPairDStream? >> >> 2) I know there is a copy/merge hadoop api for merging files...can this >> be done inside java? I am not sure the logic behind this api if I am using >> spark streaming which is constantly making new files. >> >> Thanks a lot for the help! >> > >
Re: Why are there different "parts" in my CSV?
For streaming application, for every batch it will create a new directory and puts the data in it. If you don't want to have multiple files inside the directory as part- then you can do a repartition before the saveAs* call. messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); Thanks Best Regards On Fri, Feb 13, 2015 at 11:59 AM, Su She wrote: > Hello Everyone, > > I am writing simple word counts to hdfs using > messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, > String.class, (Class) TextOutputFormat.class); > > 1) However, each 2 seconds I getting a new *directory *that is titled as > a csv. So i'll have test.csv, which will be a directory that has two files > inside of it called part-0 and part 1 (something like that). This > obv makes it very hard for me to read the data stored in the csv files. I > am wondering if there is a better way to store the JavaPairRecieverDStream > and JavaPairDStream? > > 2) I know there is a copy/merge hadoop api for merging files...can this be > done inside java? I am not sure the logic behind this api if I am using > spark streaming which is constantly making new files. > > Thanks a lot for the help! >
Why are there different "parts" in my CSV?
Hello Everyone, I am writing simple word counts to hdfs using messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class, String.class, (Class) TextOutputFormat.class); 1) However, each 2 seconds I getting a new *directory *that is titled as a csv. So i'll have test.csv, which will be a directory that has two files inside of it called part-0 and part 1 (something like that). This obv makes it very hard for me to read the data stored in the csv files. I am wondering if there is a better way to store the JavaPairRecieverDStream and JavaPairDStream? 2) I know there is a copy/merge hadoop api for merging files...can this be done inside java? I am not sure the logic behind this api if I am using spark streaming which is constantly making new files. Thanks a lot for the help!