Thank you all for your valuable inputs. Sorry for getting back late because
of personal issues.
Mich, answer to your earlier question, Yes it is a fact table. Thank you.
Ayan, I have tried ROWNUM as split column with 100 partitions. But it was
taking forever to complete the job. Thank you.
There is also another jdbc method in data frame reader api o specify your
own predicates for each partition. Using this you can control what is included
in each partition.
val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id <
200")
val df = spark.read.jdbc(
Trust me, Only thing that can help you in your situation is SQOOP oracle
direct connector which is known as ORAOOP. Spark cannot do everything ,
you need a OOZIE workflow which will trigger sqoop job with oracle direct
connector to pull the data then spark batch to process .
Hope it helps !!
On
Hi,
One way can be to use NTILE function to partition data.
Example:
REM Creating test table
create table Test_part as select * from ( select rownum rn from all_tables
t1 ) where rn <= 1000;
REM Partition lines by Oracle block number, 11 partitions in this example.
select ntile(11) over( order
Good points
Unfortunately databump. expr, imp use binary format for import and export.
that cannot be used to import data into HDFS in a suitable way.
One can use what is known as flat,sh script to get data out tab or ,
separated etc.
ROWNUM is a pseudocolumn (not a real column) that is
In oracle something called row num is present in every row. You can create
an evenly distribution using that column. If it is one time work, try using
sqoop. Are you using Oracle's own appliance? Then you can use data pump
format
On 11 Sep 2016 01:59, "Mich Talebzadeh"
creating an Oracle sequence for a table of 200million is not going to be
that easy without changing the schema. It is possible to export that table
from prod and import it to DEV/TEST and create the sequence there.
If it is a FACT table then the foreign keys from the Dimension tables will
be
Hi,
Yea, spark does not have the same functionality with sqoop.
I think one of simple solutions is to assign unique ids on the oracle table
by yourself.
Thought?
// maropu
On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh wrote:
> Strange that Oracle table of
Strange that Oracle table of 200Million plus rows has not been partitioned.
What matters here is to have parallel connections from JDBC to Oracle, each
reading a sub-set of table. Any parallel fetch is going to be better than
reading with one connection from Oracle.
Surely among 404 columns
Hello Everyone,
My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS.
*Table in Oracle:*
1) no primary key.
2) Has 404 columns.
3) Has 200,800,000 rows.
*Spark SQL:*
In my Spark SQL I want to read the data into n number of partitions in
parallel, for which I need to
10 matches
Mail list logo