Re: lowerupperBound not working/spark 1.3

2015-06-14 Thread Sujeevan
I also thought that it is an issue. After investigating it further, found
out this. https://issues.apache.org/jira/browse/SPARK-6800

Here is the updated documentation of
*org.apache.spark.sql.jdbc.JDBCRelation#columnPartition* method


Notice that lowerBound and upperBound are just used to decide the
partition stride, not for filtering the rows in table. So all rows in the
table will be partitioned and returned.

So filter has to be added manually in the query.

val jdbcDF = sqlContext.jdbc(url =
jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table =
(select * from se_staging.exp_table3 where cs_id = 1 and cs_id = 1)
as exp_table3new ,columnName=cs_id,lowerBound=1 ,upperBound = 1,
numPartitions=12 )




Best Regards,

Sujeevan. N

On Mon, Mar 23, 2015 at 4:02 AM, Ted Yu yuzhih...@gmail.com wrote:

 I went over JDBCRelation#columnPartition() but didn't find obvious clue
 (you can add more logging to confirm that the partitions were generated
 correctly).

 Looks like the issue may be somewhere else.

 Cheers

 On Sun, Mar 22, 2015 at 12:47 PM, Marek Wiewiorka 
 marek.wiewio...@gmail.com wrote:

 ...I even tried setting upper/lower bounds to the same value like 1 or 10
 with the same result.
 cs_id is a column of the cardinality ~5*10^6
 So this is not the case here.

 Regards,
 Marek

 2015-03-22 20:30 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 From javadoc of JDBCRelation#columnPartition():
* Given a partitioning schematic (a column of integral type, a number
 of
* partitions, and upper and lower bounds on the column's value),
 generate

 In your example, 1 and 1 are for the value of cs_id column.

 Looks like all the values in that column fall within the range of 1 and
 1000.

 Cheers

 On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka 
 marek.wiewio...@gmail.com wrote:

 Hi All - I try to use the new SQLContext API for populating DataFrame
 from jdbc data source.
 like this:

 val jdbcDF = sqlContext.jdbc(url =
 jdbc:postgresql://localhost:5430/dbname?user=userpassword=111, table =
 se_staging.exp_table3 ,columnName=cs_id,lowerBound=1 ,upperBound =
 1, numPartitions=12 )

 No matter how I set lower and upper bounds I always get all the rows
 from my table.
 The API is marked as experimental so I assume there might by some bugs
 in it but
 did anybody come across a similar issue?

 Thanks!







Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread Sujeevan
If your use case is more to do with querying RDBMS and then bringing the
results to spark do some analysis then Spark SQL JDBC datasource API
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
is the best. If your use case is to bring entire data to spark, then you'll
have to explore other options which depends on the datatype. For e.g. Spark
RedShift integration
http://spark-packages.org/package/databricks/spark-redshift

Best Regards,

Sujeevan. N

On Sat, Apr 25, 2015 at 4:24 PM, ayan guha guha.a...@gmail.com wrote:

 Actually, Spark SQL provides a data source. Here is from documentation -

 JDBC To Other Databases

 Spark SQL also includes a data source that can read data from other
 databases using JDBC. This functionality should be preferred over using
 JdbcRDD
 https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD.
 This is because the results are returned as a DataFrame and they can easily
 be processed in Spark SQL or joined with other data sources. The JDBC data
 source is also easier to use from Java or Python as it does not require the
 user to provide a ClassTag. (Note that this is different than the Spark SQL
 JDBC server, which allows other applications to run queries using Spark
 SQL).

 On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote:

 What is the specific usecase? I can think of couple of ways (write to
 hdfs and then read from spark or stream data to spark). Also I have seen
 people using mysql jars to bring data in. Essentially you want to simulate
 creation of rdd.
 On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com
 wrote:

 If I run spark in stand-alone mode ( not YARN mode ), is there any tool
 like Sqoop that able to transfer data from RDBMS to spark storage?

 Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Best Regards,
 Ayan Guha



[no subject]

2015-01-03 Thread Sujeevan
Best Regards,

Sujeevan. N