Trust me, Only thing that can help you in your situation is SQOOP oracle
direct connector which is known as  ORAOOP. Spark cannot do everything ,
you need a OOZIE workflow which will trigger sqoop job with oracle direct
connector to pull the data then spark batch to process .

Hope it helps !!

On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.ra...@gmail.com> wrote:

> Hi,
>
> One way can be to use NTILE function to partition data.
> Example:
>
> REM Creating test table
> create table Test_part as select * from ( select rownum rn from all_tables
> t1 ) where rn <= 1000;
>
> REM Partition lines by Oracle block number, 11 partitions in this example.
> select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) )
> nt from Test_part
>
>
> Let's see distribution:
>
> select nt, count(*) from ( select ntile(11) over( order by
> dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt;
>
>         NT   COUNT(*)
> ---------- ----------
>          1         10
>          6         10
>         11          9
>          2         10
>          4         10
>          5         10
>          8         10
>          3         10
>          7         10
>          9          9
>         10          9
>
> 11 rows selected.
> ^^ It looks good. Sure feel free to chose any other condition to order
> your lines as best suits your case
>
> So you can
> 1) have one session reading and then decide where line goes (1 reader )
> 2) Or do multiple reads by specifying partition number. Note that in this
> case you read whole table n times (in parallel) and is more internsive on
> read part. (multiple readers)
>
> Regards,
> Igor
>
>
>
> 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>
>> Good points
>>
>> Unfortunately databump. expr, imp use binary format for import and
>> export. that cannot be used to import data into HDFS in a suitable way.
>>
>> One can use what is known as flat,sh script to get data out tab or ,
>> separated etc.
>>
>> ROWNUM is a pseudocolumn (not a real column) that is available in a
>> query. The issue is that in a table of 280Million rows to get the position
>> of the row it will have to do a table scan since no index cannot be built
>> on it (assuming there is no other suitable index). Not ideal but can be
>> done.
>>
>> I think a better alternative is to use datapump to take that table to
>> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a
>> unique index on the sequence column and do the partitioning there.
>>
>> HTH
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> In oracle something called row num is present in every row.  You can
>>> create an evenly distribution using that column. If it is one time work,
>>> try using sqoop. Are you using Oracle's own appliance? Then you can use
>>> data pump format
>>> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> creating an Oracle sequence for a table of 200million is not going to
>>>> be that easy without changing the schema. It is possible to export that
>>>> table from prod and import it to DEV/TEST and create the sequence there.
>>>>
>>>> If it is a FACT table then the foreign keys from the Dimension tables
>>>> will be bitmap indexes on the FACT table so they can be potentially used.
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Yea, spark does not have the same functionality with sqoop.
>>>>> I think one of simple solutions is to assign unique ids on the oracle
>>>>> table by yourself.
>>>>> Thought?
>>>>>
>>>>> // maropu
>>>>>
>>>>>
>>>>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Strange that Oracle table of 200Million plus rows has not been
>>>>>> partitioned.
>>>>>>
>>>>>> What matters here is to have parallel connections from JDBC to
>>>>>> Oracle, each reading a sub-set of table. Any parallel fetch is going to 
>>>>>> be
>>>>>> better than reading with one connection from Oracle.
>>>>>>
>>>>>> Surely among 404 columns there must be one with high cardinality to
>>>>>> satisfy this work.
>>>>>>
>>>>>> May be you should just create table <small> as select * from
>>>>>> Oracle_table where rownum <= 1000000; and use that for test.
>>>>>>
>>>>>> Other alternative is to use Oracle SQL Connecter for HDFS
>>>>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
>>>>>> can do it for you. With 404 columns it is difficult to suggest any
>>>>>> alternative. Is this a FACT table?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Everyone,
>>>>>>>
>>>>>>> My goal is to use Spark Sql to load huge amount of data from Oracle
>>>>>>> to HDFS.
>>>>>>>
>>>>>>> *Table in Oracle:*
>>>>>>> 1) no primary key.
>>>>>>> 2) Has 404 columns.
>>>>>>> 3) Has 200,800,000 rows.
>>>>>>>
>>>>>>> *Spark SQL:*
>>>>>>> In my Spark SQL I want to read the data into n number of partitions
>>>>>>> in parallel, for which I need to provide 'partition 
>>>>>>> column','lowerBound',
>>>>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle 
>>>>>>> has
>>>>>>> no such column to satisfy this need(Highly Skewed), because of it, if 
>>>>>>> the
>>>>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>>>>>>> finishes in 20 mins and the last one takes forever.
>>>>>>>
>>>>>>> Is there anything I could do to distribute the data evenly into
>>>>>>> partitions? Can we set any fake query to orchestrate this pull process, 
>>>>>>> as
>>>>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>>>>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>>>>>>
>>>>>>> Any pointers are appreciated.
>>>>>>>
>>>>>>> Thanks for your time.
>>>>>>>
>>>>>>> ~ Ajay
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>>
>>
>

Reply via email to