Re: Spark_JDBC_Partitions

Igor Racic Tue, 13 Sep 2016 05:40:44 -0700

Hi,

One way can be to use NTILE function to partition data.
Example:


REM Creating test table
create table Test_part as select * from ( select rownum rn from all_tables
t1 ) where rn <= 1000;

REM Partition lines by Oracle block number, 11 partitions in this example.
select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt
from Test_part


Let's see distribution:

select nt, count(*) from ( select ntile(11) over( order by
dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt;

        NT   COUNT(*)
---------- ----------
         1         10
         6         10
        11          9
         2         10
         4         10
         5         10
         8         10
         3         10
         7         10
         9          9
        10          9

11 rows selected.
^^ It looks good. Sure feel free to chose any other condition to order your
lines as best suits your case

So you can
1) have one session reading and then decide where line goes (1 reader )
2) Or do multiple reads by specifying partition number. Note that in this
case you read whole table n times (in parallel) and is more internsive on
read part. (multiple readers)

Regards,
Igor



2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:

> Good points
>
> Unfortunately databump. expr, imp use binary format for import and export.
> that cannot be used to import data into HDFS in a suitable way.
>
> One can use what is known as flat,sh script to get data out tab or ,
> separated etc.
>
> ROWNUM is a pseudocolumn (not a real column) that is available in a query.
> The issue is that in a table of 280Million rows to get the position of the
> row it will have to do a table scan since no index cannot be built on it
> (assuming there is no other suitable index). Not ideal but can be done.
>
> I think a better alternative is to use datapump to take that table to
> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a
> unique index on the sequence column and do the partitioning there.
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com> wrote:
>
>> In oracle something called row num is present in every row.  You can
>> create an evenly distribution using that column. If it is one time work,
>> try using sqoop. Are you using Oracle's own appliance? Then you can use
>> data pump format
>> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>> wrote:
>>
>>> creating an Oracle sequence for a table of 200million is not going to be
>>> that easy without changing the schema. It is possible to export that table
>>> from prod and import it to DEV/TEST and create the sequence there.
>>>
>>> If it is a FACT table then the foreign keys from the Dimension tables
>>> will be bitmap indexes on the FACT table so they can be potentially used.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yea, spark does not have the same functionality with sqoop.
>>>> I think one of simple solutions is to assign unique ids on the oracle
>>>> table by yourself.
>>>> Thought?
>>>>
>>>> // maropu
>>>>
>>>>
>>>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Strange that Oracle table of 200Million plus rows has not been
>>>>> partitioned.
>>>>>
>>>>> What matters here is to have parallel connections from JDBC to Oracle,
>>>>> each reading a sub-set of table. Any parallel fetch is going to be better
>>>>> than reading with one connection from Oracle.
>>>>>
>>>>> Surely among 404 columns there must be one with high cardinality to
>>>>> satisfy this work.
>>>>>
>>>>> May be you should just create table <small> as select * from
>>>>> Oracle_table where rownum <= 1000000; and use that for test.
>>>>>
>>>>> Other alternative is to use Oracle SQL Connecter for HDFS
>>>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
>>>>> can do it for you. With 404 columns it is difficult to suggest any
>>>>> alternative. Is this a FACT table?
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Everyone,
>>>>>>
>>>>>> My goal is to use Spark Sql to load huge amount of data from Oracle
>>>>>> to HDFS.
>>>>>>
>>>>>> *Table in Oracle:*
>>>>>> 1) no primary key.
>>>>>> 2) Has 404 columns.
>>>>>> 3) Has 200,800,000 rows.
>>>>>>
>>>>>> *Spark SQL:*
>>>>>> In my Spark SQL I want to read the data into n number of partitions
>>>>>> in parallel, for which I need to provide 'partition column','lowerBound',
>>>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle 
>>>>>> has
>>>>>> no such column to satisfy this need(Highly Skewed), because of it, if the
>>>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>>>>>> finishes in 20 mins and the last one takes forever.
>>>>>>
>>>>>> Is there anything I could do to distribute the data evenly into
>>>>>> partitions? Can we set any fake query to orchestrate this pull process, 
>>>>>> as
>>>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>>>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>>>>>
>>>>>> Any pointers are appreciated.
>>>>>>
>>>>>> Thanks for your time.
>>>>>>
>>>>>> ~ Ajay
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>

Re: Spark_JDBC_Partitions

Reply via email to