Re: JdbcRDD Constructor

satish chandra j Tue, 20 Oct 2015 07:23:26 -0700

Hi Deenar,
Thanks for your valuable inputs

Here is a situation, if a Source Table does not have any such column(unique
values,numeric and sequential) which is suitable as Partition Column to be
specified for JDBCRDD Constructor or DataSource API.How to proceed further
on this scenario and also let me know if any default approach Spark is
going to implement if do not give any such inputs as "lowerbound" and
"upperbound" to JDBCRDD Constructor or DataSourceAPI


Thanks in advance for your inputs

Regards,
Satish Chandra J


On Thu, Sep 24, 2015 at 10:18 PM, Deenar Toraskar <
deenar.toras...@thinkreactive.co.uk> wrote:

> you are interpreting the JDBCRDD API incorrectly. If you want to use
> partitions, then the column used to partition and present in the where
> clause must be numeric and the lower bound and upper bound must be the min
> and max values of the column. Spark will equally distribute the range over
> the number of partitions selected. So in your case OFFSET is the first
> placeholder and LIMIT the second
>
> numPartitions 1 - Your query will be called once with first placeholder 0
> and second placeholder 100, this explains how you get 100 rows
> select *  from schema.Table OFFSET 0 LIMIT 100
>
> numPartitions 2 - Your query will be called twice with first placeholder
> 0 and second placeholder 50, and second time with 51,100. Again this
> explains why you get 150 records
>
> select *  from schema.Table OFFSET 0 LIMIT 50
> select *  from schema.Table OFFSET 51 LIMIT 100
>
> numPartitions 3 - Your query will be called thrice
>
> select *  from schema.Table OFFSET 0 LIMIT 34
> select *  from schema.Table OFFSET 35 LIMIT 67
> select *  from schema.Table OFFSET 68 LIMIT 100
>
> That explains why you get 201 records. You need to amend the query and
> provide correct lower and upper bounds aligned to the column used in the
> where clause.
>
> See
> http://www.sparkexpert.com/2015/01/02/load-database-data-into-spark-using-jdbcrdd-in-java/
>
> Deenar
>
>
>
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 24 September 2015 at 11:55, satish chandra j <jsatishchan...@gmail.com>
> wrote:
>
>> HI Deenar,
>>
>> Please find the SQL query below:
>>
>> var SQL_RDD= new JdbcRDD( sc, ()=>
>> DriverManager.getConnection(url,user,pass),"select col1, col2,
>> col3......col 37 from schema.Table LIMIT ? OFFSET ?",100,0,*1*,(r:
>> ResultSet) => (r.getInt("col1"),r.getInt("col2").......r.getInt("col37")))
>>
>>
>> When I have the above 100,0,*1 * I am getting SQL_RDD.count as 100
>> When set to 100,0,2 I am getting SQL_RDD.count as 151
>> When set to 100,0,3 I am getting SQL RDD.count as 201
>>
>> But where as I expect every execution count should be 100, let me know if
>> I am missing anything here
>>
>> Regards,
>> Satish Chandra
>>
>>
>> On Thu, Sep 24, 2015 at 12:48 AM, Deenar Toraskar <
>> deenar.toras...@thinkreactive.co.uk> wrote:
>>
>>> Satish
>>>
>>> Can you post the SQL query you are using?
>>>
>>> The SQL query must have 2 placeholders and both of them should be an
>>> inclusive range (<= and >=)..
>>>
>>> e.g. select title, author from books where ? <= id and id <= ?
>>>
>>> Are you doing this?
>>>
>>> Deenar
>>>
>>>
>>>
>>>
>>> *Think Reactive Ltd*
>>> deenar.toras...@thinkreactive.co.uk
>>> 07714140812
>>>
>>>
>>>
>>> On 23 September 2015 at 13:47, satish chandra j <
>>> jsatishchan...@gmail.com> wrote:
>>>
>>>> HI,
>>>> Could anybody provide inputs if they have came across similar issue
>>>>
>>>> @Rishitesh
>>>> Could you provide if any sample code to use JdbcRDDSuite
>>>>
>>>>
>>>> Regards,
>>>> Satish Chandra
>>>>
>>>> On Wed, Sep 23, 2015 at 5:14 PM, Rishitesh Mishra <
>>>> rishi80.mis...@gmail.com> wrote:
>>>>
>>>>> I am using Spark 1.5. I always get count = 100, irrespective of num
>>>>> partitions.
>>>>>
>>>>> On Wed, Sep 23, 2015 at 5:00 PM, satish chandra j <
>>>>> jsatishchan...@gmail.com> wrote:
>>>>>
>>>>>> HI,
>>>>>> Currently using Spark 1.2.2, could you please let me know correct
>>>>>> results output count which you got it by using JdbcRDDSuite
>>>>>>
>>>>>> Regards,
>>>>>> Satish Chandra
>>>>>>
>>>>>> On Wed, Sep 23, 2015 at 4:02 PM, Rishitesh Mishra <
>>>>>> rishi80.mis...@gmail.com> wrote:
>>>>>>
>>>>>>> Which version of Spark you are using ??  I can get correct results
>>>>>>> using JdbcRDD. Infact there is a test suite precisely for this (
>>>>>>> JdbcRDDSuite) .
>>>>>>> I changed according to your input and got correct results from this
>>>>>>> test suite.
>>>>>>>
>>>>>>> On Wed, Sep 23, 2015 at 11:00 AM, satish chandra j <
>>>>>>> jsatishchan...@gmail.com> wrote:
>>>>>>>
>>>>>>>> HI All,
>>>>>>>>
>>>>>>>> JdbcRDD constructor has following parameters,
>>>>>>>>
>>>>>>>> *JdbcRDD
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html#JdbcRDD(org.apache.spark.SparkContext,%20scala.Function0,%20java.lang.String,%20long,%20long,%20int,%20scala.Function1,%20scala.reflect.ClassTag)>*
>>>>>>>> (SparkContext
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/SparkContext.html>
>>>>>>>>  sc,
>>>>>>>> scala.Function0<java.sql.Connection> getConnection, String sql, *long 
>>>>>>>> lowerBound,
>>>>>>>> long upperBound, int numPartitions*,
>>>>>>>> scala.Function1<java.sql.ResultSet,T
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>>
>>>>>>>>  mapRow,
>>>>>>>> scala.reflect.ClassTag<T
>>>>>>>> <https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/rdd/JdbcRDD.html>
>>>>>>>> > evidence$1)
>>>>>>>>
>>>>>>>> where the below parameters *lowerBound* refers to Lower boundary
>>>>>>>> of entire data, *upperBound *refers to Upper boundary of entire
>>>>>>>> data and *numPartitions *refer to Number of partitions
>>>>>>>>
>>>>>>>> Source table to which JbdcRDD is fetching data from Oracle DB has
>>>>>>>> more than 500 records but its confusing when I tried several 
>>>>>>>> executions by
>>>>>>>> changing "numPartitions" parameter
>>>>>>>>
>>>>>>>> LowerBound,UpperBound,numPartitions: Output Count
>>>>>>>>
>>>>>>>> 0                 ,100              ,1                   : 100
>>>>>>>>
>>>>>>>> 0                 ,100              ,2                   : 151
>>>>>>>>
>>>>>>>> 0                 ,100              ,3                   : 201
>>>>>>>>
>>>>>>>>
>>>>>>>> Please help me in understanding the why Output count is 151 if
>>>>>>>> numPartitions is 2 and Output count is 201 if numPartitions is 3
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Satish Chandra
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: JdbcRDD Constructor

Reply via email to