[ https://issues.apache.org/jira/browse/PHOENIX-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220004#comment-15220004 ]
Josh Mahonin commented on PHOENIX-2804: --------------------------------------- The repartitioning is totally controlled on the Spark side, phoenix-spark has no say into it at all. How are you checking the number of partitions? Keep in mind DataFrames (and RDDs) are immutable, so you need to capture the result of the repartition() call. Here's an example I just ran: {noformat} val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "TABLE1", "zkUrl" -> "localhost:2181")) df.rdd.partitions.size // Int = 1 df.repartition(64).rdd.partitions.size // Int = 64 {noformat} > Support partition parameter or repartition function for Spark plugin > -------------------------------------------------------------------- > > Key: PHOENIX-2804 > URL: https://issues.apache.org/jira/browse/PHOENIX-2804 > Project: Phoenix > Issue Type: Improvement > Affects Versions: 4.7.0 > Reporter: SonixLegend > Fix For: 4.8.0 > > Attachments: screenshot-1.png > > > When I wanna load some hurge data from phoenix to spark dataframes via > phoenix spark plugin, and I had set the dataframes storage level was disk > only, but if I wanna do join query data between the dataframes, the spark > told me "java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE", > because the spark read over 2G file per one partition. Can you add the > partition parameter or override repartition function for load data via the > plugin? Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332)