Re: Spark_JDBC_Partitions

2016-09-19 Thread Ajay Chander
Thank you all for your valuable inputs. Sorry for getting back late because of personal issues. Mich, answer to your earlier question, Yes it is a fact table. Thank you. Ayan, I have tried ROWNUM as split column with 100 partitions. But it was taking forever to complete the job. Thank you.

Re: Spark_JDBC_Partitions

2016-09-13 Thread Suresh Thalamati
There is also another jdbc method in data frame reader api o specify your own predicates for each partition. Using this you can control what is included in each partition. val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 200") val df = spark.read.jdbc(

Re: Spark_JDBC_Partitions

2016-09-13 Thread Rabin Banerjee
Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector which is known as ORAOOP. Spark cannot do everything , you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to pull the data then spark batch to process . Hope it helps !! On

Re: Spark_JDBC_Partitions

2016-09-13 Thread Igor Racic
Hi, One way can be to use NTILE function to partition data. Example: REM Creating test table create table Test_part as select * from ( select rownum rn from all_tables t1 ) where rn <= 1000; REM Partition lines by Oracle block number, 11 partitions in this example. select ntile(11) over( order

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
Good points Unfortunately databump. expr, imp use binary format for import and export. that cannot be used to import data into HDFS in a suitable way. One can use what is known as flat,sh script to get data out tab or , separated etc. ROWNUM is a pseudocolumn (not a real column) that is

Re: Spark_JDBC_Partitions

2016-09-10 Thread ayan guha
In oracle something called row num is present in every row. You can create an evenly distribution using that column. If it is one time work, try using sqoop. Are you using Oracle's own appliance? Then you can use data pump format On 11 Sep 2016 01:59, "Mich Talebzadeh"

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
creating an Oracle sequence for a table of 200million is not going to be that easy without changing the schema. It is possible to export that table from prod and import it to DEV/TEST and create the sequence there. If it is a FACT table then the foreign keys from the Dimension tables will be

Re: Spark_JDBC_Partitions

2016-09-10 Thread Takeshi Yamamuro
Hi, Yea, spark does not have the same functionality with sqoop. I think one of simple solutions is to assign unique ids on the oracle table by yourself. Thought? // maropu On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh wrote: > Strange that Oracle table of

Re: Spark_JDBC_Partitions

2016-09-10 Thread Mich Talebzadeh
Strange that Oracle table of 200Million plus rows has not been partitioned. What matters here is to have parallel connections from JDBC to Oracle, each reading a sub-set of table. Any parallel fetch is going to be better than reading with one connection from Oracle. Surely among 404 columns

Spark_JDBC_Partitions

2016-09-10 Thread Ajay Chander
Hello Everyone, My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS. *Table in Oracle:* 1) no primary key. 2) Has 404 columns. 3) Has 200,800,000 rows. *Spark SQL:* In my Spark SQL I want to read the data into n number of partitions in parallel, for which I need to