[jira] [Commented] (PHOENIX-898) Extend PhoenixHBaseStorage to specify upsert columns

Prashant Kommireddi (JIRA) Fri, 28 Mar 2014 16:33:12 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951578#comment-13951578
 ]


Prashant Kommireddi commented on PHOENIX-898:
---------------------------------------------

Thanks [~jviolettedsiq], this is a useful feature. I have a few comments

Why are column family names needed? Can the "Phoenix table" column names not be 
used directly? Do users care about column family? Seems a bit different from 
your proposal, am I missing something?
{code}
 * The above reads a file 'testdata' and writes the elements ID, F.B, and F.E 
to HBase. 
 * In this example, ID is the row key, and F is the column family for the data 
elements.
{code}


This could fail if a user specified "hbase://tableName//". It would be cleaner 
to convert user query URI and fetch the "scheme" (hbase), "authority" 
(tableName) and "path" (column names) from the input.
{code}
+                       else if (tokens.length==2) {
+                               tableName = tokens[0];
+                               columns = tokens[1];
+                               config.configure(server, tableName, batchSize, 
columns);
+                       } else {
{code}


Let's also have a cleaner error message for users in the following. We could 
change it to reflect what exactly went wrong (unrecognized 
scheme/authority/path)
{code}
else {
+                               throw new IOException(String.format("Invalid 
location: ",location));
+                       }
{code}


We can avoid setting all confs and instead delegate to the existing configure 
method. Let's just have this method call the overloaded method to set other 
properties, and this method just sets 'columns'.
{code}
+       public void configure(String server, String tableName, long batchSize, 
String columns) {
+               conf.set(SERVER_NAME, server);
+               conf.set(TABLE_NAME, tableName);
+               conf.set(UPSERT_COLUMNS, columns);
+               conf.setLong(UPSERT_BATCH_SIZE, batchSize);
+               conf.setBoolean(MAP_SPECULATIVE_EXEC, false);
+               conf.setBoolean(REDUCE_SPECULATIVE_EXEC, false);
+               
+       }
{code}


This one is minor, but you could use Sets.newHashSet instead
{code}
Set<String> upsertColumnSet = new HashSet<String>();
{code}

You could use StringUtils here 
http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#isNotEmpty(java.lang.String)
{code}
if (upsertColumns!=null && !"".equals(upsertColumns))
{code}

Let's use a StringBuilder to do the concat
{code}
+                       String parsedColumns = "";
+                       for (String key : upsertColumnSet) {
+                               parsedColumns += key +",";
+                       }
{code}

This requires indentation. <space> before/after equals, <space> before/after 
"?", <space> after comma.
{code}
String fullColumn = 
(colFam==null?colName:String.format("%s.%s",colFam,colName));
{code}

Indentation needs to be fixed at a few places.

Can you also please add a test case for this? PHOENIX-888 aims to add test 
cases, and [~maghamravikiran] has graciously offered his help on that. We could 
get started with a basic test for this use-case.

Thanks again for the contribution.

> Extend PhoenixHBaseStorage to specify upsert columns
> ----------------------------------------------------
>
>                 Key: PHOENIX-898
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-898
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: James Violette
>             Fix For: 3.0.0
>
>         Attachments: PHOENIX_898.patch
>
>
> We have a Phoenix table with data from multiple sources.  We would like to 
> write a pig script that upserts only data associated with a feed, leaving 
> other data alone.  The current PhoenixHBaseStorage automatically upserts all 
> columns in a table.
> Given this table schema as an example, 
> create TABLE IF NOT EXISTS MYSCHEMA.MYTABLE
>  (NAME varchar not null
>   ,D.INFO VARCHAR
>   ,D.D1 DOUBLE
>   ,D.I1 INTEGER
>   ,D.C1 VARCHAR
>  CONSTRAINT pk PRIMARY KEY (NAME));   
> Assuming 'A' is loaded into pig,
> The current syntax loads all columns into MYSCHEMA.MYTABLE:
> STORE A into 'hbase://MYSCHEMA.MYTABLE' using
>     org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000');
> We could specify upsert columns after the table in the hbase:// url.  
> This column-based example is equivalent to the full table upsert.
> STORE A into 'hbase://MYSCHEMA.MYTABLE/NAME,D.INFO,D.D1,D.I1,D.C1' using
>     org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000');
> This column-based example chooses to load only three of the five columns.
> STORE A into 'hbase://MYSCHEMA.MYTABLE/NAME,D.INFO,D.I1' using
>     org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000');
> This change would touch 
> PhoenixHBaseStorage.setStoreLocation - parse the columns
> PhoenixPigConfiguration.configure - add an optional column list parameter.
> PhoenixPigConfiguration.setup - create the upsert statement and create the 
> column metadata list
> The rest of the code should work as-is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-898) Extend PhoenixHBaseStorage to specify upsert columns

Reply via email to