Re: What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

2016-01-23 Thread Rex X
Googled, but didnot find any sample code. On Fri, Jan 22, 2016 at 9:50 AM, Rex X wrote: > The two SequentialTextFiles correspond to two Hive tables, say tableA and > tableB below on > > hdfs://hive/tableA//MM/DD/*/part-0 > and > hdfs://hive/tableB//

What's the best way to do Outer join and Inner join of two SequentialTextFiles using Hadoop streaming and Python ?

2016-01-22 Thread Rex X
The two SequentialTextFiles correspond to two Hive tables, say tableA and tableB below on hdfs://hive/tableA//MM/DD/*/part-0 and hdfs://hive/tableB//MM/DD/*/part-0 Both of them are partitioned by date, for example, hdfs://hive/tableA/2016/01/01/*/part-0 Now we wa

Re: What is the best way to locate the offset and length of all fields in a Hadoop sequential text file?

2016-01-22 Thread Rex X
have any information about your data. > > I don't think we can help you with this. Also, I cannot understand what > you are trying to achieve. Please also tell us why you are using hadoop > streaming instead of hive to do your operations. > > Regards, > LLoyd > > O

What is the best way to locate the offset and length of all fields in a Hadoop sequential text file?

2016-01-21 Thread Rex X
The given sequential files correspond to an external Hive table. They are stored in /tableName/part-0 /tableName/part-1 ... There are about 2000 attributes in the table. Now I want to process the data using Hadoop streaming and mapReduce. The first step is to find the offset and length fo

Re: Hadoop Streaming: How to parition output into subfolders?

2016-01-21 Thread Rex X
Hi Camusensei, Thank you. That's very helpful! Rex On Thu, Jan 21, 2016 at 1:41 AM, Namikaze Minato wrote: > Hi Rex X, > > We are using the -outputFormat option of hadoop-streaming. > Here is the detail: http://www.infoq.com/articles/HadoopOutputFormat > > Regards,

Re: Hadoop Streaming: How to parition output into subfolders?

2016-01-20 Thread Rex X
t; > . > > Regards > Rohit Sarewar > > > On Thu, Jan 21, 2016 at 5:13 AM, Rex X wrote: > >> Dear all, >> >> To be specific, for example, given >> >> hadoop jar hadoop-streaming.jar \ >> -input myInputDirs \ >> -output

Hadoop Streaming: How to parition output into subfolders?

2016-01-20 Thread Rex X
Dear all, To be specific, for example, given hadoop jar hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /usr/bin/wc Where myInputDirs has a *dated* subfolder structure of /input_dir//mm/dd/part-* I want myOutp