Re: Review Request 42327: SQOOP-2788: Parquet support for HdfsConnector

Ryan Blue Thu, 21 Jan 2016 14:50:15 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42327/#review115691
-----------------------------------------------------------




connector/connector-hdfs/pom.xml (line 88)
<https://reviews.apache.org/r/42327/#comment176705>

    Does the exclusion from dependencyManagement apply here? I don't think you 
need both.



connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
 (line 64)
<https://reviews.apache.org/r/42327/#comment176706>

    Minor: when you relocate dependencies like this, you are more likely to get 
conflicts when backporting changes or cherry-picking across branches. I 
recommend reverting a move like this and setting your editor to avoid 
relocations.



connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
 (line 249)
<https://reviews.apache.org/r/42327/#comment176713>

    Why not use the existing Hadoop class? That's what we did for Kite: 
https://github.com/kite-sdk/kite/blob/master/kite-data/kite-data-core/src/main/java/org/kitesdk/data/spi/filesystem/InputFormatReader.java#L69
    
    Then you don't need to include your own anonymous class. There's a bit of 
overhead to get the right constructor but I think that's a better strategy.



connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
 (line 503)
<https://reviews.apache.org/r/42327/#comment176708>

    I think a better and cheaper check is to read the first 4 bytes (and/or the 
last 4 bytes) and make sure they are "PAR1". The first 4 and the last 4 for all 
valid parquet files contain that magic pattern.



connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java
 (line 53)
<https://reviews.apache.org/r/42327/#comment176753>

    Does this make the default UNCOMPRESSED? I think I'd rather see SNAPPY by 
default.



connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java
 (line 58)
<https://reviews.apache.org/r/42327/#comment176754>

    Why use the OutputFormat and RecordWriter here instead of an 
AvroParquetWriter? That is more straight-forward.



connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
 (line 156)
<https://reviews.apache.org/r/42327/#comment176755>

    Should this be assertTrue instead of assert?



connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
 (line 179)
<https://reviews.apache.org/r/42327/#comment176758>

    Why is default compression not supported?



connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
 (line 188)
<https://reviews.apache.org/r/42327/#comment176759>

    Why is this value changed?



connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopAvroUtils.java
 (line 46)
<https://reviews.apache.org/r/42327/#comment176762>

    I think this should be an error. Changing names without warning will 
confuse users. If the name is invalid, then it should be mapped somehow or 
there should be an option (defaulted to false) to turn on this behavior.



pom.xml (line 716)
<https://reviews.apache.org/r/42327/#comment176703>

    You should only need to depend on parquet-avro and parquet-hadoop. The 
parquet-common module will be included transitively and I don't think there is 
anything in there you would need to depend on directly.



pom.xml (line 720)
<https://reviews.apache.org/r/42327/#comment176704>

    Why exclude Avro? The dependency version should be managed by maven to 
match the version depended on by Sqoop directly.



test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java 
(line 55)
<https://reviews.apache.org/r/42327/#comment176769>

    This only has positive test cases. I'd like to see at least a negative case 
for null handling that I saw above. Any other cases where the CSV passed into 
the writer is unexpected would be good to validate this isn't going to blow up 
at runtime when the data producer does something unexpected.



test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java 
(line 96)
<https://reviews.apache.org/r/42327/#comment176763>

    Why is this a multiset?


- Ryan Blue


On Jan. 21, 2016, 11:06 a.m., Abraham Fine wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42327/
> -----------------------------------------------------------
> 
> (Updated Jan. 21, 2016, 11:06 a.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Bugs: SQOOP-2788
>     https://issues.apache.org/jira/browse/SQOOP-2788
> 
> 
> Repository: sqoop-sqoop2
> 
> 
> Description
> -------
> 
> read and write parquet in hdfsconnector
> 
> 
> Diffs
> -----
> 
>   connector/connector-hdfs/pom.xml 599631418ca63cc43d645c1ee1e7a73dc824b313 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsExtractor.java
>  a813c479a07d68e14ed49936f642e762e5b37437 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsLoader.java
>  774221aaf5c8cdb8d26ca108fae239598b42229b 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/HdfsPartition.java
>  644de60581faf90ceb2fcef8d3e0544067791fcc 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/configuration/ToFormat.java
>  27d121f529ecb4d5bd79e2b8c74ab8f7cc15fb10 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/GenericHdfsWriter.java
>  2ccccc4a94a582c8b47ccdefa523d1fd1632e627 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsParquetWriter.java
>  PRE-CREATION 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsSequenceWriter.java
>  75c2e7ef192d7d9628e622cc3c5ef176e33a73d0 
>   
> connector/connector-hdfs/src/main/java/org/apache/sqoop/connector/hdfs/hdfsWriter/HdfsTextWriter.java
>  78cf9732fdb89689b04d43e4af70ca5a43732dbf 
>   
> connector/connector-hdfs/src/test/java/org/apache/sqoop/connector/hdfs/TestLoader.java
>  11fcef2a38209c79928f582cf8aa03e889247f22 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/common/SqoopAvroUtils.java
>  985149cbb0d28b55a19d17076d996364d7f2ae90 
>   
> connector/connector-sdk/src/main/java/org/apache/sqoop/connector/idf/AVROIntermediateDataFormat.java
>  d78fa8b72ecfe62eeec240e01597e7f2a7e4dd76 
>   pom.xml cb8a973abc96af1de905cebd80d30177cbaf1cb4 
>   test/pom.xml 644a9c7dbc746d4a3268532bdcf0babd4faaafba 
>   
> test/src/test/java/org/apache/sqoop/integration/connector/hdfs/ParquetTest.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/42327/diff/
> 
> 
> Testing
> -------
> 
> integration tests pass
> 
> 
> Thanks,
> 
> Abraham Fine
> 
>

Re: Review Request 42327: SQOOP-2788: Parquet support for HdfsConnector

Reply via email to