-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/8104/#review13962
-----------------------------------------------------------
Overall looks great! I haven't gone through the test cases yet, but here are my
comments so far.
1) I noticed that I cannot load .avro files that are not record types. For
example, I tried to load a .avro file whose schema is "int" as follows:
[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar getschema
foo2/test_int.avro
"int"
[cheolsoo@cheolsoo-mr1-0 pig-svn]$ java -jar avro-tools-1.5.4.jar tojson
foo2/test_int.avro
1
in = LOAD 'foo2/test_int.avro' USING AvroStorage('int');
DUMP in;
This gives me the following error:
Caused by: java.io.IOException: avroSchemaToResourceSchema only processes
records
Can only Avro record type be loaded? Or am I doing something wrong?
2) TestAvroStorage needs to be more automated. To run it, I had to run the
following commands:
ant clean compile-test
cd ./test/org/apache/pig/builtin/avro
python createests.py
cd -
ant clean test -Dtestcase=TestAvroStorage
Ideally, I should be able to run a single command: ant clean
-Dtestcase=TestAvroStorage. Please let me know if you need help for this.
3) python createests.py fails with the following errors. I suppose that some
files are missing:
creating data/avro/uncompressed/testDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException:
data/json/testDirectoryCounts.json (No such file or directory)
...
creating evenFileNameTestDirectoryCounts.avro
Exception in thread "main" java.io.FileNotFoundException:
data/json/evenFileNameTestDirectoryCounts.json (No such file or directory)
...
4) ant test -Dtestcase=TestAvroStorage fails with the following errors. I
suppose that this is due to the missing files:
Testcase: testLoadDirectory took 0.005 sec
FAILED
Testcase: testLoadGlob took 0.004 sec
FAILED
Testcase: testPartialLoadGlob took 0.005 sec
FAILED
5) Typo in the name of createests.py. It should be createtests.py.
6) Is createTests.bash needed at all? If not, can you remove it?
I have more comments inline:
ivy/libraries.properties
<https://reviews.apache.org/r/8104/#comment29844>
Can you update .eclipse.templates/.classpath too?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29845>
Can you remove this comment?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29850>
Can you change Exception to SchemaParseException?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29907>
Wouldn't it be better to make this static?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29847>
FileSystem.get(new Configuration()).open() will look for a file on the
default FS that is specified by Hadoop conf. But it won't work in the following
case:
1) The default FS is set by installed Hadoop conf as follows:
<name>fs.default.name</name>
<value>hdfs://cheolsoo-mr1-0.ent.cloudera.com:8020</value>
2) Run Pig in *local* mode and specify a schema file that is on local FS.
This gives me the following error:
java.io.FileNotFoundException: File does not exist:
/user/cheolsoo/test_record.avsc
3) Even if I specify the uri schema as file:/, this still gives me the
following error:
java.lang.IllegalArgumentException: Wrong FS:
file:/home/cheolsoo/pig-svn/test_record.avsc, expected:
hdfs://cheolsoo-mr1-0.ent.cloudera.com:8020
Can you instead do the following?
Path p = new Path(...);
(FileSystem.get(p.toUri(), new Configuration()).open(p);
Then, local FS will be used if file:/ is specified in path, and the default
FS is if no uri scheme is specified.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29848>
Same problem as above.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29849>
Can you split this into two catch blocks: ParseException and IOException?
It seems incorrect that the help message is printed when FileSystem throws
an IOExcetion.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29851>
Can you filter out files that start with "." as well?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29852>
Same as above.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29853>
This won't work in the following case. Let's say p matches two dirs, and
one dir is empty.
p = foo*
foo1
foo2/bar.avro
I would expect the schema of bar.avro is returned, but I get an IOException
instead.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29854>
Typo: not => No.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29857>
Can you rename this class without Fast?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29856>
Is this needed?
In the constructor, schema is supposed to be set. If not, there must be an
error. Shouldn't we throw an exception instead of re-trying to set schema?
Please correct me if I am wrong.
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29855>
Can you replace ".avro" with AvroOutputFormat.EXT?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29870>
Can you move this comment to the checkSchema() method?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29858>
Can you move this comment to the prepareToWrite() method?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29859>
Can you instead use log.debug(..., e)?
src/org/apache/pig/builtin/AvroStorage.java
<https://reviews.apache.org/r/8104/#comment29865>
Can you instead use log.debug(..., e)?
src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29860>
Can you filter out files that start with "." as well?
src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29861>
AvroStorage accepts files that do not end ".avro". Shouldn't TrevniStorage
do the same?
src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29862>
Is this needed?
In the constructor, schema is supposed to be set. If not, there must be an
error. Shouldn't we throw an exception instead of re-trying to set schema?
src/org/apache/pig/builtin/TrevniStorage.java
<https://reviews.apache.org/r/8104/#comment29863>
AvroStorage accepts files that do not end ".avro". Shouldn't TrevniStorage
do the same?
src/org/apache/pig/impl/util/AvroRecordReader.java
<https://reviews.apache.org/r/8104/#comment29864>
I can't find where -ignoreErrors is used. I guess that error handling for
bad files is not implemented yet?
src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29866>
Can you instead use log.debug(..., e)?
src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29867>
Can you instead use log.debug(..., e)?
src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29871>
How about a union type that contains a single data type (e.g. ["string"])?
They're currently supported.
src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java
<https://reviews.apache.org/r/8104/#comment29868>
Can you instead use log.debug(..., e)?
src/org/apache/pig/impl/util/AvroTupleWrapper.java
<https://reviews.apache.org/r/8104/#comment29869>
Can you instead use log.debug(..., e)?
- Cheolsoo Park
On Nov. 17, 2012, 5:28 a.m., Joseph Adler wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/8104/
> -----------------------------------------------------------
>
> (Updated Nov. 17, 2012, 5:28 a.m.)
>
>
> Review request for pig and Cheolsoo Park.
>
>
> Description
> -------
>
> The current AvroStorage implementation has a lot of issues: it requires old
> versions of Avro, it copies data much more than needed, and it's verbose and
> complicated. (One pet peeve of mine is that old versions of Avro don't
> support Snappy compression.)
>
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the
> new implementation is significantly faster, and the code is a lot simpler.
> Rewriting AvroStorage also enabled me to implement support for Trevni.
>
> This is the latest version of the patch, complete with test cases and
> TrevniStorage. (Test cases for TrevniStorage are still missing).
>
>
> This addresses bug PIG-3015.
> https://issues.apache.org/jira/browse/PIG-3015
>
>
> Diffs
> -----
>
> build.xml 7d468a0
> ivy.xml 70e8d50
> ivy/libraries.properties 317564f
> src/org/apache/pig/builtin/AvroStorage.java PRE-CREATION
> src/org/apache/pig/builtin/TrevniStorage.java PRE-CREATION
> src/org/apache/pig/impl/util/AvroBagWrapper.java PRE-CREATION
> src/org/apache/pig/impl/util/AvroMapWrapper.java PRE-CREATION
> src/org/apache/pig/impl/util/AvroRecordReader.java PRE-CREATION
> src/org/apache/pig/impl/util/AvroRecordWriter.java PRE-CREATION
> src/org/apache/pig/impl/util/AvroStorageDataConversionUtilities.java
> PRE-CREATION
> src/org/apache/pig/impl/util/AvroStorageSchemaConversionUtilities.java
> PRE-CREATION
> src/org/apache/pig/impl/util/AvroTupleWrapper.java PRE-CREATION
> test/commit-tests 5081fbc
> test/org/apache/pig/builtin/TestAvroStorage.java PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/directory_test.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/identity.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/identity_ai1_ao2.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/identity_ao2.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/identity_codec.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/identity_just_ao2.pig
> PRE-CREATION
> test/org/apache/pig/builtin/avro/code/pig/recursive_tests.pig PRE-CREATION
> test/org/apache/pig/builtin/avro/createTests.bash PRE-CREATION
> test/org/apache/pig/builtin/avro/createests.py PRE-CREATION
> test/org/apache/pig/builtin/avro/data/avro/deflate/records.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/deflate/recordsAsOutputByPig.avro
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/avro/snappy/records.avro PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordWithRepeatedSubRecords.avro
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/records.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsAsOutputByPig.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArrays.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsOfArraysOfRecords.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchema.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsSubSchemaNullable.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithEnums.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithFixed.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMaps.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithMapsOfRecords.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recordsWithNullableUnions.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/avro/uncompressed/recursiveRecord.avro
> PRE-CREATION
>
> test/org/apache/pig/builtin/avro/data/json/recordWithRepeatedSubRecords.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/records.json PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsAsOutputByPig.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsOfArrays.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsOfArraysOfRecords.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsSubSchema.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsSubSchemaNullable.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsWithEnums.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsWithFixed.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsWithMaps.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsWithMapsOfRecords.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recordsWithNullableUnions.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/data/json/recursiveRecord.json
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordWithRepeatedSubRecords.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/records.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsAsOutputByPig.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsOfArrays.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsOfArraysOfRecords.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsSubSchema.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsSubSchemaNullable.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsWithEnums.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsWithFixed.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsWithMaps.avsc PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsWithMapsOfRecords.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recordsWithNullableUnions.avsc
> PRE-CREATION
> test/org/apache/pig/builtin/avro/schema/recursiveRecord.avsc PRE-CREATION
> test/unit-tests 0f18a0e
>
> Diff: https://reviews.apache.org/r/8104/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Joseph Adler
>
>