[jira] [Commented] (SQOOP-331) Support boundary query on the command line

Arvind Prabhakar (JIRA) Tue, 06 Sep 2011 11:53:35 -0700

    [ 
https://issues.apache.org/jira/browse/SQOOP-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098262#comment-13098262
 ]


Arvind Prabhakar commented on SQOOP-331:
----------------------------------------

Thanks for the excellent patch Jarek. It is almost ready for commit except for 
a few things:

* The case for free-form query needs to be handled. This would be in the 
{{DataDrivenImportJob}} class within the {{else}} block below your current 
modifications. Here is a sample diff that I cooked up quickly to convey this 
point:
{noformat}
--- src/java/com/cloudera/sqoop/mapreduce/DataDrivenImportJob.java      
(revision 1165478)
+++ src/java/com/cloudera/sqoop/mapreduce/DataDrivenImportJob.java      
(working copy)
@@ -153,15 +153,26 @@
         DataDrivenDBInputFormat.setInput(job, DBWritable.class,
             mgr.escapeTableName(tableName), whereClause,
             mgr.escapeColName(splitByCol), sqlColNames);
+
+        // If user specified boundary query on the command line propagate it to
+        // the job
+        if(options.getBoundaryQuery() != null) {
+          DataDrivenDBInputFormat.setBoundingQuery(job.getConfiguration(),
+                  options.getBoundaryQuery());
+        }
       } else {
         // Import a free-form query.
         String inputQuery = options.getSqlQuery();
         String sanitizedQuery = inputQuery.replace(
             DataDrivenDBInputFormat.SUBSTITUTE_TOKEN, " (1 = 1) ");
 
-        String inputBoundingQuery =
-            mgr.getInputBoundsQuery(splitByCol, sanitizedQuery);
+        String inputBoundingQuery = options.getBoundaryQuery();
+
         if (inputBoundingQuery == null) {
+          mgr.getInputBoundsQuery(splitByCol, sanitizedQuery);
+        }
+
+        if (inputBoundingQuery == null) {
             inputBoundingQuery = "SELECT MIN(" + splitByCol + "), MAX("
                     + splitByCol + ") FROM (" + sanitizedQuery + ") AS t1";
         }

{noformat}

* Second: since you have introduced a new command line option, it is necessary 
that the userguide and man pages be updated. These are located under 
{{src/docs}} directory and can be built using {{ant docs}} target. In order to 
build them though, you would need to have {{asciidoc}} isntalled on your 
machine.
* Bonus nit: there is a checkstyle violation in ImportTool:530 where the line 
is longer than 80 characters.

Apart from that everything looks great. Some suggestions going forward:
* Usually when introducing a new functionality, it is required to have at least 
one test that exercises that functionality. The test you have added is good but 
does not really exercise the functionality. 
* We use Apache Review Board (https://reviews.apache.org/) to post reviews for 
patches that are longer than a few lines. This helps the reviewers give 
contextual feedback where necessary. 

Please let me know if you have any questions for me on these suggestions.


> Support boundary query on the command line
> ------------------------------------------
>
>                 Key: SQOOP-331
>                 URL: https://issues.apache.org/jira/browse/SQOOP-331
>             Project: Sqoop
>          Issue Type: New Feature
>          Components: tools
>    Affects Versions: 1.4.0
>            Reporter: Jarek Jarcec Cecho
>            Assignee: Jarek Jarcec Cecho
>         Attachments: SQOOP-331.patch
>
>
> It would be nice if the sqoop would have ability to specify query that will 
> fetch minimal and maximal value for creating splits in 
> DataDrivenDBInputFormat from the command line.
> Normally sqoop will generate query to get maximal and minimal value for 
> creating splits in following form: SELECT min($split_by_column), 
> max($split_by_column) FROM $table WHERE $cmd_where. In my use case, I needed 
> to import only portion of data with ranges based on the split_by_column that 
> I already have preselected and that are available in special table that holds 
> data ranges and appropriate primary key values. So my auto generated query 
> looked like this: SELECT min(id), max(id) FROM table WHERE id => min_id and 
> id <= max_id. That query is obviously useless and is just creating 
> unnecessary load on the database server. It would be nice to supply my own 
> boundary query that will use the extra table with data ranges.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-331) Support boundary query on the command line

Reply via email to