Trust me, Only thing that can help you in your situation is SQOOP oracle direct connector which is known as ORAOOP. Spark cannot do everything , you need a OOZIE workflow which will trigger sqoop job with oracle direct connector to pull the data then spark batch to process .
Hope it helps !! On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.ra...@gmail.com> wrote: > Hi, > > One way can be to use NTILE function to partition data. > Example: > > REM Creating test table > create table Test_part as select * from ( select rownum rn from all_tables > t1 ) where rn <= 1000; > > REM Partition lines by Oracle block number, 11 partitions in this example. > select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) > nt from Test_part > > > Let's see distribution: > > select nt, count(*) from ( select ntile(11) over( order by > dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt; > > NT COUNT(*) > ---------- ---------- > 1 10 > 6 10 > 11 9 > 2 10 > 4 10 > 5 10 > 8 10 > 3 10 > 7 10 > 9 9 > 10 9 > > 11 rows selected. > ^^ It looks good. Sure feel free to chose any other condition to order > your lines as best suits your case > > So you can > 1) have one session reading and then decide where line goes (1 reader ) > 2) Or do multiple reads by specifying partition number. Note that in this > case you read whole table n times (in parallel) and is more internsive on > read part. (multiple readers) > > Regards, > Igor > > > > 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>: > >> Good points >> >> Unfortunately databump. expr, imp use binary format for import and >> export. that cannot be used to import data into HDFS in a suitable way. >> >> One can use what is known as flat,sh script to get data out tab or , >> separated etc. >> >> ROWNUM is a pseudocolumn (not a real column) that is available in a >> query. The issue is that in a table of 280Million rows to get the position >> of the row it will have to do a table scan since no index cannot be built >> on it (assuming there is no other suitable index). Not ideal but can be >> done. >> >> I think a better alternative is to use datapump to take that table to >> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a >> unique index on the sequence column and do the partitioning there. >> >> HTH >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com> wrote: >> >>> In oracle something called row num is present in every row. You can >>> create an evenly distribution using that column. If it is one time work, >>> try using sqoop. Are you using Oracle's own appliance? Then you can use >>> data pump format >>> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com> >>> wrote: >>> >>>> creating an Oracle sequence for a table of 200million is not going to >>>> be that easy without changing the schema. It is possible to export that >>>> table from prod and import it to DEV/TEST and create the sequence there. >>>> >>>> If it is a FACT table then the foreign keys from the Dimension tables >>>> will be bitmap indexes on the FACT table so they can be potentially used. >>>> >>>> HTH >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Yea, spark does not have the same functionality with sqoop. >>>>> I think one of simple solutions is to assign unique ids on the oracle >>>>> table by yourself. >>>>> Thought? >>>>> >>>>> // maropu >>>>> >>>>> >>>>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Strange that Oracle table of 200Million plus rows has not been >>>>>> partitioned. >>>>>> >>>>>> What matters here is to have parallel connections from JDBC to >>>>>> Oracle, each reading a sub-set of table. Any parallel fetch is going to >>>>>> be >>>>>> better than reading with one connection from Oracle. >>>>>> >>>>>> Surely among 404 columns there must be one with high cardinality to >>>>>> satisfy this work. >>>>>> >>>>>> May be you should just create table <small> as select * from >>>>>> Oracle_table where rownum <= 1000000; and use that for test. >>>>>> >>>>>> Other alternative is to use Oracle SQL Connecter for HDFS >>>>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that >>>>>> can do it for you. With 404 columns it is difficult to suggest any >>>>>> alternative. Is this a FACT table? >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello Everyone, >>>>>>> >>>>>>> My goal is to use Spark Sql to load huge amount of data from Oracle >>>>>>> to HDFS. >>>>>>> >>>>>>> *Table in Oracle:* >>>>>>> 1) no primary key. >>>>>>> 2) Has 404 columns. >>>>>>> 3) Has 200,800,000 rows. >>>>>>> >>>>>>> *Spark SQL:* >>>>>>> In my Spark SQL I want to read the data into n number of partitions >>>>>>> in parallel, for which I need to provide 'partition >>>>>>> column','lowerBound', >>>>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle >>>>>>> has >>>>>>> no such column to satisfy this need(Highly Skewed), because of it, if >>>>>>> the >>>>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task >>>>>>> finishes in 20 mins and the last one takes forever. >>>>>>> >>>>>>> Is there anything I could do to distribute the data evenly into >>>>>>> partitions? Can we set any fake query to orchestrate this pull process, >>>>>>> as >>>>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS >>>>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ? >>>>>>> >>>>>>> Any pointers are appreciated. >>>>>>> >>>>>>> Thanks for your time. >>>>>>> >>>>>>> ~ Ajay >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> --- >>>>> Takeshi Yamamuro >>>>> >>>> >>>> >> >