Re: importing data into hdfs/spark using Informatica ETL tool

Michael Segel Wed, 09 Nov 2016 15:25:38 -0800

Oozie, a product only a mad Russian would love. ;-)

Just say no to hive. Go from Flat to Parquet.
(This sounds easy, but there’s some work that has to occur…)


Sorry for being cryptic, Mich’s question is pretty much generic for anyone 
building a data lake so it ends up overlapping with some work that I have to do…

-Mike

On Nov 9, 2016, at 4:16 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Thanks guys,

Sounds like let Informatica get the data out of RDBMS and create mapping to 
flat files that will be delivered to a directory visible by HDFS host. Then 
push the csv files into HDFS. then there are number of options to work on:


  1.  run cron or oozie to get data out of HDFS (or build external Hive table 
on that directory) and do insert/select into Hive managed table
  2.  alternatively use a spark job to get CSV data into RDD and then create 
tempTable and do insert/select from tempTable to Hive table. Bear in mind that 
we need a spark job tailored to each table schema

I believe the above is feasible?





Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 9 November 2016 at 21:26, Jörn Franke 
<jornfra...@gmail.com<mailto:jornfra...@gmail.com>> wrote:
Basically you mention the options. However, there are several ways how 
informatica can extract (or store) from/to rdbms. If the native option is not 
available then you need to go via JDBC as you have described.
Alternatively (but only if it is worth it) you can schedule fetching of the 
files via oozie and use it to convert the csv into orc/ parquet etc.
If this is a common use case in the company you can extend informatica with 
Java classes that for instance convert the data directly into parquet or orc. 
However, is some effort.

On 9 Nov 2016, at 14:56, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Hi,

I am exploring the idea of flexibility with importing multiple RDBMS tables 
using Informatica that customer has into HDFS.

I don't want to use connectivity tools from Informatica to Hive etc.

So this is what I have in mind


  1.
If possible get the tables data out using Informatica and use Informatica ui  
to convert RDBMS data into some form of CSV, TSV file (Can Informatica do it?) 
I guess yes
  2.
Put the flat files on an edge where HDFS node can see them.
  3.
Assuming that a directory can be created by Informatica daily, periodically run 
a cron that ingest that data from directories into HDFS equivalent daily 
directories
  4.
Once the data is in HDFS one can use, Spark csv, Hive etc to query data

The problem I have is to see if someone has done such thing before. 
Specifically can Informatica create target flat files on normal directories.

Any other generic alternative?

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

Re: importing data into hdfs/spark using Informatica ETL tool

Reply via email to