Re: Teradata into hadoop Migration

2016-08-05 Thread Arun Natva
Bhagaban,
First step is to ingest data into Hadoop using sqoop.
Teradata has powerful connectors to Hadoop where the connectors are to be 
installed on all data nodes and then run imports using fast export etc., 

Challenge would be to create the same workflows in Hadoop that you had in 
teradata.

Teradata is rich in features compared to Hive & Impala.

Mostly data in teradata is encrypted so pls make sure you have HDFS encryption 
at rest enabled.

You can use oozie to create a chain of SQLs to mimic your ETL jobs written in 
Datastage or TD itself or Informatica.

Please note that TD may perform better than Hadoop since it has proprietary 
hardware and software which is efficient, Hadoop can save you money.


Sent from my iPhone

> On Aug 5, 2016, at 12:02 PM, praveenesh kumar  wrote:
> 
> From TD perspective have a look at this - https://youtu.be/NTTQdAfZMJA They 
> are planning to opensource it. Perhaps you can get in touch with the team. 
> Let me know if you are interested. If you are TD contacts, ask about this, 
> they should be able to point to the right people.
> 
> Again, this is not sales pitch. This tool looks like what you are looking for 
> and will be open source soon. Let me know if you want to get in touch with 
> the folks you are working on this. 
> 
> Regards
> Prav
> 
>> On Fri, Aug 5, 2016 at 4:29 PM, Wei-Chiu Chuang  wrote:
>> Hi,
>> 
>> I think Cloudera Navigator Optimizer is the tool you are looking for. It 
>> allows you to transform SQL queries (TD) into Impala and Hive.
>> http://blog.cloudera.com/blog/2015/11/introducing-cloudera-navigator-optimizer-for-optimal-sql-workload-efficiency-on-apache-hadoop/
>> Hope this doesn’t sound like a sales pitch. If you’re a Cloudera paid 
>> customer you should reach out to the account/support team for more 
>> information.
>> 
>> *disclaimer: I work for Cloudera
>> 
>> Wei-Chiu Chuang
>> A very happy Clouderan
>> 
>>> On Aug 4, 2016, at 10:50 PM, Rakesh Radhakrishnan  
>>> wrote:
>>> 
>>> Sorry, I don't have much insight about this apart from basic Sqoop. AFAIK, 
>>> it is more of vendor specific, you may need to dig more into that line.
>>> 
>>> Thanks,
>>> Rakesh
>>> 
 On Mon, Aug 1, 2016 at 11:38 PM, Bhagaban Khatai 
  wrote:
 Thanks Rakesh for the useful information. But we are using sqoop for data 
 transfer but all TD logic we are implementing thru Hive.
 But it's taking time by using mapping provided by TD team and the same 
 logic we are implementing.
 
 What I want some tool or ready-made framework so that development effort 
 would be less.
 
 Thanks in advance for your help.
 
 Bhagaban 
 
> On Mon, Aug 1, 2016 at 6:07 PM, Rakesh Radhakrishnan  
> wrote:
> Hi Bhagaban,
> 
> Perhaps, you can try "Apache Sqoop" to transfer data to Hadoop from 
> Teradata. Apache Sqoop provides an efficient approach for transferring 
> large data between Hadoop related systems and structured data stores. It 
> allows support for a data store to be added as a so-called connector and 
> can connect to various databases including Oracle etc.
> 
> I hope the below links will be helpful to you,
> http://sqoop.apache.org/
> http://blog.cloudera.com/blog/2012/01/cloudera-connector-for-teradata-1-0-0/
> http://hortonworks.com/blog/round-trip-data-enrichment-teradata-hadoop/
> http://dataconomy.com/wp-content/uploads/2014/06/Syncsort-A-123ApproachtoTeradataOffloadwithHadoop.pdf
> 
> Below are few data ingestion tools, probably you can dig more into it,
> https://www.datatorrent.com/product/datatorrent-ingestion/
> https://www.datatorrent.com/dtingest-unified-streaming-batch-data-ingestion-hadoop/
> 
> Thanks,
> Rakesh
> 
>> On Mon, Aug 1, 2016 at 4:54 PM, Bhagaban Khatai 
>>  wrote:
>> Hi Guys-
>> 
>> I need a quick help if anybody done any migration project in TD into 
>> hadoop.
>> We have very tight deadline and I am trying to find any tool (online or 
>> paid) for quick development.
>> 
>> Please help us here and guide me if any other way is available to do the 
>> development fast.
>> 
>> Bhagaban
> 


Re: CCA-500 exam tips

2016-06-23 Thread Arun Natva
I pressed send button by mistake before.
I cleared CCDH , CCAH couple of years ago.

I can suggest that you go ahead and write code to do:
Sqoop imports / exports 
Hive queries that use partitioning, maps, and other complex data types
Pig scripts to do the same processing that you did with hive
Avro, parquet file formats
Simple spark code using scala, Python, Java

Sent from my iPhone

> On Jun 23, 2016, at 9:19 AM, Nagalingam, Karthikeyan 
>  wrote:
> 
> Hi,
>  
> I am planning to take CCA-500 ( cloudera certified Administrator for Apache 
> Hadoop ) exam. Can you throw some light on exam tips, questions to pass the 
> exam?
>  
> Regards,
> Karthikeyan Nagalingam,
> Technical Marketing Engineer ( Big Data Analytics)
> Mobile: 919-376-6422
>  


Re: CCA-500 exam tips

2016-06-23 Thread Arun Natva
I have passed Cc

Sent from my iPhone

> On Jun 23, 2016, at 9:19 AM, Nagalingam, Karthikeyan 
>  wrote:
> 
> Hi,
>  
> I am planning to take CCA-500 ( cloudera certified Administrator for Apache 
> Hadoop ) exam. Can you throw some light on exam tips, questions to pass the 
> exam?
>  
> Regards,
> Karthikeyan Nagalingam,
> Technical Marketing Engineer ( Big Data Analytics)
> Mobile: 919-376-6422
>  


Re: How to share files amongst multiple jobs using Distributed Cache in Hadoop 2.7.2

2016-06-07 Thread Arun Natva
If you use the Instance of Job class, you can add files to distributed cache 
like this:
Job job = Job.getInstanceOf(conf);
job.addCacheFiles(filepath);


Sent from my iPhone

> On Jun 7, 2016, at 5:17 AM, Siddharth Dawar  
> wrote:
> 
> Hi,
> 
> I wrote a program which creates Map-Reduce jobs in an iterative fashion as 
> follows:
> 
> 
> while (true) 
> {
> JobConf conf2  = new JobConf(getConf(),graphMining.class);
> 
> conf2.setJobName("sid");
> conf2.setMapperClass(mapperMiner.class);
> conf2.setReducerClass(reducerMiner.class);
> 
> conf2.setInputFormat(SequenceFileInputFormat.class);
> conf2.setOutputFormat(SequenceFileOutputFormat.class);
> conf2.setOutputValueClass(BytesWritable.class);
> 
> conf2.setMapOutputKeyClass(Text.class);
> conf2.setMapOutputValueClass(MapWritable.class);
> conf2.setOutputKeyClass(Text.class);
> 
> conf2.setNumMapTasks(Integer.parseInt(args[3]));
> conf2.setNumReduceTasks(Integer.parseInt(args[4]));
> FileInputFormat.addInputPath(conf2, new Path(input));
> FileOutputFormat.setOutputPath(conf2, new Path(output)); }
> RunningJob job = JobClient.runJob(conf2);
> }
> 
> Now, I want the first Job which gets created to write something in the 
> distributed cache and the jobs which get created after the first job to read 
> from the distributed cache. 
> 
> I came to know that the DistributedCache.addcacheFiles() method is 
> deprecated, so the documentation suggests to use Job.addcacheFiles() method 
> specific for each job.
> 
> But, I am unable to get an handle of the currently running job, as 
> JobClient.runJob(conf2) submits a job internally.
> 
> 
> How can I share the content written by the first job in this while loop 
> available via distributed cache to other jobs which get created in later 
> iterations of while loop ? 
> 


Re: Performance Benchmarks on "Number of Machines"

2016-05-27 Thread Arun Natva
Deepak,
I believe yahoo and Facebook have largest clusters like over 4-5 thousand nodes 
of size.. 
If you add a new server to the cluster, you are simply adding to the cpu, 
memory, disk space of the cluster.. So, the capacity grows linearly as you add 
nodes except that network bandwidth is shared

I didn't understand your last question on scaling... 


Sent from my iPhone

> On May 27, 2016, at 11:51 AM, Deepak Goel  wrote:
> 
> 
> Hey
> 
> Namaskara~Nalama~Guten Tag~Bonjour
> 
> Are there any performance benchmarks as to how many machines can Hadoop scale 
> up to? Is the growth linear (For 1 machine - growth x, for 2 machines - 2x 
> growth, for 1 machines - 1x growth??)
> 
> Also does the scaling depend on the type of jobs and amount of data? Or is it 
> independent?
> 
> Thank You
> Deepak
>-- 
> Keigu
> 
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
> 
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
> 
> "Contribute to the world, environment and more : http://www.gridrepublic.org
> "


Re: Reliability of Hadoop

2016-05-27 Thread Arun Natva
Deepak,
I have managed clusters where worker nodes crashed, disks failed..
HDFS takes care of the data replication unless you loose too many of the nodes 
where there is not enough space to fit the replicas.



Sent from my iPhone

> On May 27, 2016, at 11:54 AM, Deepak Goel  wrote:
> 
> 
> Hey
> 
> Namaskara~Nalama~Guten Tag~Bonjour
> 
> We are yet to see any server go down in our cluster nodes in the production 
> environment? Has anyone seen reliability problems in their production 
> environment? How many times?
> 
> Thanks
> Deepak
>-- 
> Keigu
> 
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
> 
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
> 
> "Contribute to the world, environment and more : http://www.gridrepublic.org
> "


Re: Unable to start Daemons

2016-05-11 Thread Arun Natva
Add the JAVA_HOME statement inside Hadoop-env.sh and retry.

Sent from my iPhone

> On May 11, 2016, at 3:57 AM, Anand Murali  
> wrote:
> 
> Dear All:
> 
> Please advise after viewing carefully below
> 
> 1. Altered .profile after ssh local host and included 
> export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_91/
> 
> 2. Into cd/etc/hadoop, and added following entries in hadoop-env.sh
> 
>  export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_91
>   export HADOOP_INSTALL=/home/anand/hadoop-2.6.0
>   export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
> 
> 3. $. hadoop-env.sh
> 
> 4.hadoop version
> 
> nand@anand-Latitude-E5540:~/hadoop-2.6.0/etc/hadoop$ hadoop version
> Hadoop 2.6.0
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
> Compiled by jenkins on 2014-11-13T21:10Z
> Compiled with protoc 2.5.0
> From source with checksum 18e43357c8f927c0695f1e9522859d6a
> This command was run using 
> /home/anand/hadoop-2.6.0/share/hadoop/common/hadoop-common-2.6.0.jar
> 
> 5. anand@anand-Latitude-E5540:~/hadoop-2.6.0/sbin$ start-dfs.sh --config 
> /home/anand/hadoop-2.6.0/sbin
> Starting namenodes on [localhost]
> localhost: Error: JAVA_HOME is not set and could not be found.
> cat: /home/anand/hadoop-2.6.0/sbin/slaves: No such file or directory
> Starting secondary namenodes [0.0.0.0]
> 0.0.0.0: Error: JAVA_HOME is not set and could not be found.
> 
> 
> I have installed latest JDK as above. Removed all old versions of JDK, gone 
> thru all update-alternative sequences but still get this error. Shall be 
> thankful if somebody can help.
> 
> Regards
>  
> Anand Murali  
> 11/7, 'Anand Vihar', Kandasamy St, Mylapore
> Chennai - 600 004, India
> Ph: (044)- 28474593/ 43526162 (voicemail)


HDFS file format identification

2016-05-09 Thread Arun Natva
Hi,
Right now the HDFS Java api class FileStatus doesn't have the file format as 
one of the attributes
Can this be added as an additional field in FileStatus ?

Thanks,
Arun

Sent from my iPhone
-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org