> On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > line 63 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line63> > > > > Nit: Do you think that US_ASCII would serve our purpose better?
US_ASCII is a 7-bit encoding I think. We need something that uses all 8-bits, otherwise we'll see an exception. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AbstractIntermediateDataFormat.java, > > line 150 > > <https://reviews.apache.org/r/16812/diff/3/?file=454178#file454178line150> > > > > Do not seems to be used anywhere. If it's just plan for the future, I > > would suggest to do it in separate standalone JIRA. Makes sense: https://issues.apache.org/jira/browse/SQOOP-1280 > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 154-161 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line154> > > > > We are writing bytes, but reading UTF string - those two are not > > necessary the same, I think that we should read the data same way we have > > written them down. writeUTF and readUTF use modified UTF-8. Any string should be automatically converted. Writing and reading modified UTF-8 should work. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormatTest.java, > > lines 24-25 > > <https://reviews.apache.org/r/16812/diff/3/?file=454180#file454180line24> > > > > Nit: Unused imports. They are used here. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > spi/src/main/java/org/apache/sqoop/connector/spi/SqoopConnector.java, line > > 90 > > <https://reviews.apache.org/r/16812/diff/3/?file=454203#file454203line90> > > > > I would define this method as a concrete and return the CSVInputFormat > > by default. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > common/src/main/java/org/apache/sqoop/schema/type/Column.java, lines 102-104 > > <https://reviews.apache.org/r/16812/diff/3/?file=454169#file454169line102> > > > > This currently do not seems to be used, so perhaps we can address it in > > separate JIRA? Indeed: https://issues.apache.org/jira/browse/SQOOP-1280 > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 127-132 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line127> > > > > Converting Decimal to Double is not a valid operation. Decimal is a > > fixed point data type whereas double is floating point. Hence they have > > different use case and behavior. Also, I think that some decimal columns can be huge. Will try to use the BigDecimal data type. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/util/InitializationUtils.java, > > line 32 > > <https://reviews.apache.org/r/16812/diff/3/?file=454173#file454173line32> > > > > Might I suggest to get this refactoring done in separate JIRA? > > > > I would also suggest to move this code to the connector SDK (with it's > > dependencies if needed). Could we do minor refactoring (the separating out of the code makes sense since it's reused often). > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java, > > lines 211-212 > > <https://reviews.apache.org/r/16812/diff/3/?file=454204#file454204line211> > > > > This seems to be a stand alone bug, would you mind addressing it in > > standalone JIRA? getHioSchema doesn't seem to be used, but getConnectorSchema is used all over the place. Is getHioSchema related to SQOOP-1072? > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsTextImportLoader.java, > > lines 86-89 > > <https://reviews.apache.org/r/16812/diff/3/?file=454188#file454188line86> > > > > Can we make a test for this class to verify that serialization of most > > troublesome combinations will be alright? I'm worried that serializing the > > data as UTF might not work in case that we want to store our binary > > format... Makes sense. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 90-91 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line90> > > > > Why we are removing the first one? > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 79-80 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line79> > > > > I'm trying to see why we are setting the "\0" for the escape character > > - are we trying to disable the escape all together? Changing the getFields method to not use OpenCSV. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > line 117 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line117> > > > > Do you think that we can take advantage of the CSV reader to "unescape" > > the string and binary field for us? To skip some of the operations... Unfortunately not. It will strip all escape characters, which is not what is desired. http://sourceforge.net/p/opencsv/support-requests/5/ sums up how to approach parsing apparently. I've removed OpenCSV in favor of custom code. It turns out that major CSV parser implementations in Java don't support quote escaping or escape character escaping. I tried OpenCSV, SuperCSV, JCSV, and JavaCSV. JavaCSV seemed to work, unfortunately it is governed by a GPL license. I don't think we can use it? > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsExportExtractor.java, > > line 39 > > <https://reviews.apache.org/r/16812/diff/3/?file=454186#file454186line39> > > > > It seems that by this patch entire "Data" class becomes slightly > > obsolete, so I'm thinking if it would make sense to remove it altogether? Would you mind if we removed it in a separate Jira? As far as I can tell it is only being used for its constants after this patch, but it seems to be used every where. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 93-94 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line93> > > > > It seems worth of throwing an exception. Changing the getFields method to not use OpenCSV. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 134-135 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line134> > > > > Probably all right for the first implementation, but we should > > considering implementing most of the supported data types eventually. > > Probably a good candidate for follow up JIRA. +1 > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormatTest.java, > > line 34 > > <https://reviews.apache.org/r/16812/diff/3/?file=454180#file454180line34> > > > > I would suggest to add following tests: > > > > * Create byte array with all 256 possibilities and try the > > serialization/deserialization > > * Create String with all characters that we are usually replacing and > > try the serialization/deserialization Great idea! This helped isolate some problems with the CSV parser unfortunately. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java, > > lines 86-87 > > <https://reviews.apache.org/r/16812/diff/3/?file=454179#file454179line86> > > > > It seems worth of throwing an exception. Changing the getFields method to not use OpenCSV. > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > pom.xml, lines 121-125 > > <https://reviews.apache.org/r/16812/diff/3/?file=454201#file454201line121> > > > > The dependency section on the root pom.xml file is only for items that > > are required by all modules. Hence the CSV dependency should rather go to > > dependencyManagement section and then to corresponding submodule. Removed OpenCSV. Maybe can follow up with a Jira to add a better CSV parser? Might have to make these CSV parsers better in general. Maybe when Apache Commons CSV parser is ready? > On Feb. 6, 2014, 12:25 a.m., Jarek Cecho wrote: > > execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java, > > lines 28-31 > > <https://reviews.apache.org/r/16812/diff/3/?file=454189#file454189line28> > > > > This seems a bit troublesome to me to force the IDF extract the data > > into String where we do not necessary have to. What about using the IDF > > itself to call the write() and readFields()? > > > > I've looked into other Writables and it seems that if we will make the > > Writable also Configurable, Hadoop will pass the config object for us > > before any serialization/deserialization. We should try it out and see if > > that will help us avoid the unnecessary serialization. Check out > > GenericWritable as an example: > > > > > > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/GenericWritable.html Having the IDF do the writing makes sense. I'll update it in my next review. I checked out GenericWritable and it seems it implements Configurable and which should receive a Configuration object before serialization/deserialization as you've mentioned. This makes sense. I'll test it out a bit. Thanks for finding this! - Abraham ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/16812/#review33500 ----------------------------------------------------------- On Jan. 30, 2014, 3:17 a.m., Abraham Elmahrek wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/16812/ > ----------------------------------------------------------- > > (Updated Jan. 30, 2014, 3:17 a.m.) > > > Review request for Sqoop. > > > Bugs: SQOOP-777 > https://issues.apache.org/jira/browse/SQOOP-777 > > > Repository: sqoop-sqoop2 > > > Description > ------- > > Previous review from hari is in first iteration of this review only. > > commit 1c5122611fbd4a46a629421f7c55746cfc14f136 > Author: Hari Shreedharan <[email protected]> > Date: Fri Jan 10 15:07:08 2014 -0800 > > SQOOP-777. Sqoop2: Pluggable Intermediate Data Format > > Implemented a pluggable intermediate data format that decouples the > internal representation of the data from the connector and the output > formats. Connectors can choose to implement and support a format that is more > efficient for them. Also separated the SqoopWritable so th > > I ran a full build - all tests including integration tests pass. I have > not added any new tests, yet. I will add unit tests for the new classes. > Also, I have not tried running this on an actual cluster - so things may be > broken. I'd like some initial feedback based on the c > > I also implemented escaping of characters. There is some work remaining > to support binary format, but it is mostly integration, the basic > implementation is in place. > > :100644 100644 db11b5b... 9bfa07d... M common/pom.xml > :100644 100644 3e1adc7... f971240... M > common/src/main/java/org/apache/sqoop/etl/io/DataReader.java > :100644 100644 d81364e... e547875... M > common/src/main/java/org/apache/sqoop/etl/io/DataWriter.java > :100644 100644 8b630b2... 30c26a3... M > common/src/main/java/org/apache/sqoop/schema/type/Column.java > :100644 100644 e0da80f... 9c70db9... M > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcConnector.java > :100644 100644 ef39cdc... 075890f... M > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java > :100644 100644 96818ba... 795ffdb... M > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportInitializer.java > :000000 100644 0000000... 6c33423... A > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/util/InitializationUtils.java > :100644 100644 aa1c4ff... 7220018... M > connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExportLoader.java > :100644 100644 a7ed6ba... 28399f2... M > connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestImportExtractor.java > :100644 100644 4056e14... f54837d... M connector/connector-sdk/pom.xml > :000000 100644 0000000... d4874f2... A > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/CSVIntermediateDataFormat.java > :000000 100644 0000000... 63e14d2... A > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/IntermediateDataFormat.java > :000000 100644 0000000... 6e5479f... A > connector/connector-sdk/src/test/java/org/apache/sqoop/connector/CSVIntermediateDataFormatTest.java > :100644 100644 e052584... 3b2ef94... M > core/src/main/java/org/apache/sqoop/framework/JobManager.java > :100644 100644 53d0039... 9f5c47d... M > core/src/main/java/org/apache/sqoop/framework/SubmissionRequest.java > :100644 100644 f9a2a0e... 9754afd... M execution/mapreduce/pom.xml > :100644 100644 5c0a027... 7680e33... M > execution/mapreduce/src/main/java/org/apache/sqoop/execution/mapreduce/MapreduceExecutionEngine.java > :100644 100644 7fd9a01... 3604898... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/JobConstants.java > :100644 100644 1978ec6... 099cdd3... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsExportExtractor.java > :100644 100644 a07c511... ee6bf39... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsSequenceImportLoader.java > :100644 100644 4621942... 7b799ca... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsTextImportLoader.java > :000000 100644 0000000... 71ae980... A > execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java > :100644 100644 356ae8a... b495cc9... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopFileOutputFormat.java > :100644 100644 92de37e... 8164ffe... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java > :100644 100644 90de6ef... b3503da... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopNullOutputFormat.java > :100644 100644 7dedee9... 16e59d8... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java > :100644 100644 98a2c51... 5aaceb3... M > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopReducer.java > :100644 100644 e21f15b... 4e9e6ea... M > execution/mapreduce/src/test/java/org/apache/sqoop/job/JobUtils.java > :100644 100644 b7079dd... ddda423... M > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsExtract.java > :100644 100644 f849aae... 2968411... M > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsLoad.java > :100644 100644 7b264c6... 954990f... M > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestMapReduce.java > :000000 100644 0000000... aea7de3... A > execution/mapreduce/src/test/java/org/apache/sqoop/job/io/SqoopWritableTest.java > :100644 100644 bee8ab7... 663dfb5... M > execution/mapreduce/src/test/java/org/apache/sqoop/job/mr/TestSqoopOutputFormatLoadExecutor.java > :100644 100644 1e2f005... a722c74... M pom.xml > :100644 100644 0b240e8... 43f17d4... M spi/pom.xml > :100644 100644 2becc56... 298e8f5... M > spi/src/main/java/org/apache/sqoop/connector/spi/SqoopConnector.java > :100644 100644 6fc485b... 8d1a4e8... M > submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java > > > Diffs > ----- > > common/pom.xml db11b5ba70093704bc3f7a0a98cf7d44172e2b7d > common/src/main/java/org/apache/sqoop/etl/io/DataReader.java > 3e1adc7084cf9ef1b93c8146110c751c7de34376 > common/src/main/java/org/apache/sqoop/etl/io/DataWriter.java > d81364ef81bc747437f5b78520bacec8ee2613a3 > common/src/main/java/org/apache/sqoop/schema/type/Column.java > 8b630b28295700b204da9f5a948eaef106ca559b > > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcConnector.java > e0da80f77230fb836ef6ab36e3bd867d6cef553a > > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java > ef39cdcbc41280a0dec338c8c2ab27b7cf76385b > > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportInitializer.java > 96818ba27b25538643f9c5777ed8abebdf0eb1e9 > > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/util/InitializationUtils.java > PRE-CREATION > > connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestExportLoader.java > aa1c4ff7ff138de5efc5bc1bca5b0d869d214a1c > > connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestImportExtractor.java > a7ed6bab95c76c4928414e7850cc3e52a3ce4595 > > connector/connector-generic-jdbc/src/test/java/org/apache/sqoop/connector/jdbc/TestImportInitializer.java > a33fa3631400c3382a72e6286b1efb3fca0c9cb9 > connector/connector-sdk/pom.xml 4056e14978cb5e679233e9fdde11d53f1ca2756b > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AbstractIntermediateDataFormat.java > PRE-CREATION > > connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormat.java > PRE-CREATION > > connector/connector-sdk/src/test/java/org/apache/sqoop/connector/idf/CSVIntermediateDataFormatTest.java > PRE-CREATION > core/src/main/java/org/apache/sqoop/framework/JobManager.java > e0525846029bb394b2614a33595ebe6fcdd4aaf2 > core/src/main/java/org/apache/sqoop/framework/SubmissionRequest.java > 53d003980a172e4e0acf18630a3496909c17cb5c > execution/mapreduce/pom.xml f9a2a0e57b499cc785bdb6548a2d6113e7b867c0 > > execution/mapreduce/src/main/java/org/apache/sqoop/execution/mapreduce/MapreduceExecutionEngine.java > 5c0a02739a5f7598dc83ab82631b80410ab39213 > execution/mapreduce/src/main/java/org/apache/sqoop/job/JobConstants.java > 7fd9a017140fb91292a88e5a944ae549915c86e4 > > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsExportExtractor.java > 1978ec61e07beb6b1e75bc4da8c698c738ac4e4c > > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsSequenceImportLoader.java > a07c5111e9e037dab4dacb128a53918f61599495 > > execution/mapreduce/src/main/java/org/apache/sqoop/job/etl/HdfsTextImportLoader.java > 462194238755e603dd60226a97cf357cf63e0f20 > > execution/mapreduce/src/main/java/org/apache/sqoop/job/io/SqoopWritable.java > PRE-CREATION > > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopFileOutputFormat.java > 356ae8ab6f3cad13b644e1f033278d9012938956 > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopMapper.java > 92de37e1b7ffd3c4ce584e4d10c4e1f9af03c369 > > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopNullOutputFormat.java > 90de6efa5f3145d0480e375d82bd5c6a9760a799 > > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopOutputFormatLoadExecutor.java > 7dedee9f12b11e3af5cf6ee5c81bbc377a04a4fd > execution/mapreduce/src/main/java/org/apache/sqoop/job/mr/SqoopReducer.java > 98a2c518b65130ebe272faf721b77099a1e85a63 > execution/mapreduce/src/test/java/org/apache/sqoop/job/JobUtils.java > e21f15b8be2aacb7463ab4c41cfce63cbb71cf57 > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsExtract.java > b7079dd47c80c776a508b07f0ee5152c1d3fd20e > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestHdfsLoad.java > f849aaef10c3b995afbe6326353256affbceee67 > execution/mapreduce/src/test/java/org/apache/sqoop/job/TestMapReduce.java > 7b264c68ae761fbaf91bc51ed312de70b9ba3fd3 > > execution/mapreduce/src/test/java/org/apache/sqoop/job/io/SqoopWritableTest.java > PRE-CREATION > > execution/mapreduce/src/test/java/org/apache/sqoop/job/mr/TestSqoopOutputFormatLoadExecutor.java > bee8ab75f1f8e6508fca905aafe7c67f6b5d8d9a > pom.xml 1e2f0057ca2e9cd635ccbce75e5d6e39eeabb163 > spi/pom.xml 0b240e8a1146cc7a06a2c50c181fb73b11e6dd8c > spi/src/main/java/org/apache/sqoop/connector/spi/SqoopConnector.java > 2becc56364343f1cffdf91d9b17144666acbdd2e > > submission/mapreduce/src/main/java/org/apache/sqoop/submission/mapreduce/MapreduceSubmissionEngine.java > 6fc485b66ea8c33d09d172675a104954d570f3a7 > > Diff: https://reviews.apache.org/r/16812/diff/ > > > Testing > ------- > > mvn test > > > Thanks, > > Abraham Elmahrek > >
