effect on data after topology change

2012-01-16 Thread rk vishu
Hello All,

If i change the rackid for some nodes and restart namenode, will data be
rearranged accordingly? Do i need to run rebalancer?

Any information on this would be appreciated.

Thanks and Regards
Ravi


small files problem in hdfs

2012-01-16 Thread rk vishu
Hello All,

Could any one give me some information how flume handles small files? If
flume agents are setup for text log files, how will flume ensure that there
are not many small files?. I believe waiting for fixed time before pumping
to HDFS may not guarantee the block sized files.

I am trying to write a client app to collect data to hdfs directly using
Java APIs. I am sure i will come across this issue. Are there any utilities
or tricks to combine files from hdfs to larger files (without an MR job).

Any help will be greatly appreciated

-R


Re: effect on data after topology change

2012-01-17 Thread rk vishu
Thank you very much Todd. I hope futute versions of hadoop rebalcer will
include this check.

I have one more question.

If we are in the process of setting up additional nodes incrementally in
different rack (say rack-2) and rack-2 size is only 25% of rack-1, how
would data be balanced (with default implementation)?
i.e Will hadoop prefers balancing the overall nodes or will it try to obey
the topology first that could fillup rack-2 quickly?.  I am positive that
it will try to balance overall nodes but want to be sure.

Thanks and Regards
Ravi
On Tue, Jan 17, 2012 at 10:41 AM, Todd Lipcon  wrote:

> Hi Ravi,
>
> You'll probably need to up the replication level of the affected files
> and then drop it back down to the desired level. Current versions of
> HDFS do not automatically repair rack policy violations if they're
> introduced in this manner.
>
> -Todd
>
> On Mon, Jan 16, 2012 at 3:53 PM, rk vishu  wrote:
> > Hello All,
> >
> > If i change the rackid for some nodes and restart namenode, will data be
> > rearranged accordingly? Do i need to run rebalancer?
> >
> > Any information on this would be appreciated.
> >
> > Thanks and Regards
> > Ravi
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>


Map Red SequenceFi​le output to Hive table

2012-01-26 Thread rk vishu
Hello All,

I have a mapred job that does transfermation and outputs to a compresses
SequenceFile (by using org.apache.hadoop.mapred.SequenceFileOutputFormat)

I am able to attach the output to a external hive table (stored as
sequncefile). When i query it ignores the first column value from the file.
Is there a way to generate the MAP Red out put as expected by HIVE?

Any tips on this are highly appriciated.


-R


Re: Map Red SequenceFi​le output to Hive table

2012-01-26 Thread rk vishu
I did specify the first column in the table creation.

On Thu, Jan 26, 2012 at 2:15 PM, Mapred Learn wrote:

> In your external table creation, do you specify the first column ?
>
> Sent from my iPhone
>
> On Jan 26, 2012, at 2:09 PM, rk vishu  wrote:
>
> > Hello All,
> >
> > I have a mapred job that does transfermation and outputs to a compresses
> > SequenceFile (by using org.apache.hadoop.mapred.SequenceFileOutputFormat)
> >
> > I am able to attach the output to a external hive table (stored as
> > sequncefile). When i query it ignores the first column value from the
> file.
> > Is there a way to generate the MAP Red out put as expected by HIVE?
> >
> > Any tips on this are highly appriciated.
> >
> >
> > -R
>


Re: Map Red SequenceFi​le output to Hive table

2012-01-26 Thread rk vishu
Something like below.

CREATE TABLE stg.my_tab(
col1 String,
col2 String,
col3 String
) row format delimited fields terminated by '\t' lines terminated by '\n'
  stored as sequencefile
  location '/xyz/mytable/;
LOAD DATA INPATH '/tmp/mymapredout/part-*' INTO TABLE stg.my_tab
;


On Thu, Jan 26, 2012 at 3:49 PM, Mapred Learn wrote:

> Can u share your create table command ?
>
> Sent from my iPhone
>
> On Jan 26, 2012, at 2:21 PM, rk vishu  wrote:
>
> > I did specify the first column in the table creation.
> >
> > On Thu, Jan 26, 2012 at 2:15 PM, Mapred Learn  >wrote:
> >
> >> In your external table creation, do you specify the first column ?
> >>
> >> Sent from my iPhone
> >>
> >> On Jan 26, 2012, at 2:09 PM, rk vishu  wrote:
> >>
> >>> Hello All,
> >>>
> >>> I have a mapred job that does transfermation and outputs to a
> compresses
> >>> SequenceFile (by using
> org.apache.hadoop.mapred.SequenceFileOutputFormat)
> >>>
> >>> I am able to attach the output to a external hive table (stored as
> >>> sequncefile). When i query it ignores the first column value from the
> >> file.
> >>> Is there a way to generate the MAP Red out put as expected by HIVE?
> >>>
> >>> Any tips on this are highly appriciated.
> >>>
> >>>
> >>> -R
> >>
>


Brisk vs Cloudera Distribution

2012-02-07 Thread rk vishu
Hello All,

Could any one help me understand pros and cons of Brisk vs Cloudera Hadoop
(DHFS + HBASE) in terms of functionality and performance?
Wanted to keep aside the single point of failure (NN) issue while comparing?
Are there any big clusters in petabytes using brisk in production? How is
the performance comparision CFS vs HDFS? How is Hive integration?

Thanks and Regrds
RK


Re: Brisk vs Cloudera Distribution

2012-02-08 Thread rk vishu
Thank you for the information.

On Wed, Feb 8, 2012 at 8:57 PM, Edward Capriolo wrote:

> Hadoop can work on a number of filessytems hdfs , s3. Local files. Brisk
> file system is known as cfs. Cfs stores all block and meta data in
> cassandra. Thus it does not use a name node. Brisk fires up a jobtracker
> automatically as well. Brisk also has a hivemeta store backed by cassandra
> so takes away that spof.
>
> Brisk snappy compresses all data so you may not need to use compression or
> sequence files. Performance wise I have gotten comparable numbers with tera
> sort and tera gen. But the system work vastly differently and likely it
> scales differently.
>
> The hive integration is solid. Not sure what the biggest cluster is or
> making other vague performance claims. Brisk is not active anymore the
> commercial product is dse. There is a github fork of brisk however.
>
>
> On Wednesday, February 8, 2012, rk vishu  wrote:
> > Hello All,
> >
> > Could any one help me understand pros and cons of Brisk vs Cloudera
> Hadoop
> > (DHFS + HBASE) in terms of functionality and performance?
> > Wanted to keep aside the single point of failure (NN) issue while
> comparing?
> > Are there any big clusters in petabytes using brisk in production? How is
> > the performance comparision CFS vs HDFS? How is Hive integration?
> >
> > Thanks and Regrds
> > RK
> >
>


better partitioning strategy in hive

2012-02-18 Thread rk vishu
Hello All,

We have a hive table partitioned by date and hour(330 columns). We have 5
years worth of data for the table. Each hourly partition have around 800MB.
So total 43,800 partitions with one file per partition.

When we run select count(*) from table, hive is taking for ever to submit
the job. I waited for 20 min and killed it. If i run for a month it takes
little time to submit the job, but at least hive is able to get the work
done?.

Questions:
1) first of all why hive is not able to even submit the job? Is it taking
for ever to query the list pf partitions from the meta store? getting 43K
recs should not be big deal at all??
2) So in order to improve my situation, what are my options? I can think of
changing the partition strategy to daily partition instead of hourly. What
should be the ideal partitioning strategy?
3) if we have one partition per day and 24 files under it (i.e less
partitions but same number of files), will it improve anything or i will
have same issue ?
4)Are there any special input formats or tricks to handle this?
5) When i tried to insert into a different table by selecting from whole
days data, hive generate 164mappers with map-only jobs, hence creating many
output files. How can force hive to create one output file instead of many.
Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
can do to achieve this?


-RK


Re: better partitioning strategy in hive

2012-02-18 Thread rk vishu
> Hello All,
>
> We have a hive table partitioned by date and hour(330 columns). We have 5
> years worth of data for the table. Each hourly partition have around 800MB.
> So total 43,800 partitions with one file per partition.
>
> When we run select count(*) from table, hive is taking for ever to submit
> the job. I waited for 20 min and killed it. If i run for a month it takes
> little time to submit the job, but at least hive is able to get the work
> done?.
>
> Questions:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
> 4)Are there any special input formats or tricks to handle this?
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
>
>
> -RK
>
>
>
>
>
>