[ https://issues.apache.org/jira/browse/PHOENIX-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350498#comment-16350498 ]
Josh Mahonin commented on PHOENIX-4490: --------------------------------------- FWIW, I think there should be a more elegant solution here. It would be nice if theseĀ sorts of parameters could be passed in as options to the Dataframe / Dataset builder, and then carried forward as needed. As I recall, the Configuration object itself isĀ _not_ Serializable, which is a big challenge for Spark, and why it gets re-created several times within the phoenix-spark module. Perhaps there's another solution for that problem we could leverage? Glad there's a workaround, but if anyone has time for a patch to the underlying issue, that would be fantastic! > Phoenix Spark Module doesn't pass in user properties to create connection > ------------------------------------------------------------------------- > > Key: PHOENIX-4490 > URL: https://issues.apache.org/jira/browse/PHOENIX-4490 > Project: Phoenix > Issue Type: Bug > Reporter: Karan Mehta > Priority: Major > > Phoenix Spark module doesn't work perfectly in a Kerberos environment. This > is because whenever new {{PhoenixRDD}} are built, they are always built with > new and default properties. The following piece of code in > {{PhoenixRelation}} is an example. This is the class used by spark to create > {{BaseRelation}} before executing a scan. > {code} > new PhoenixRDD( > sqlContext.sparkContext, > tableName, > requiredColumns, > Some(buildFilter(filters)), > Some(zkUrl), > new Configuration(), > dateAsTimestamp > ).toDataFrame(sqlContext).rdd > {code} > This would work fine in most cases if the spark code is being run on the same > cluster as HBase, the config object will pickup properties from Class path > xml files. However in an external environment we should use the user provided > properties and merge them before creating any {{PhoenixRelation}} or > {{PhoenixRDD}}. As per my understanding, we should ideally provide properties > in {{DefaultSource#createRelation() method}}. > An example of when this fails is, Spark tries to get the splits to optimize > the MR performance for loading data in the table in > {{PhoenixInputFormat#generateSplits()}} methods. Ideally, it should get all > the config parameters from the {{JobContext}} being passed, but it is > defaulted to {{new Configuration()}}, irrespective of what user passes in. > Thus it fails to create a connection. > [~jmahonin] [~maghamraviki...@gmail.com] > Any ideas or advice? Let me know if I am missing anything obvious here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)