Re: Performance Issue with Parquet-Avro

Yan Qi Thu, 04 Dec 2014 11:57:51 -0800

Hi Ryan,

When I set both read schema and request schema to be the one with 4 fields
only (i.e., a subset of the file schema, Profile.getClassSchema()), I had
the following error though,


14/12/04 11:48:01 INFO mapred.JobClient: Task Id :
attempt_201410141621_22583_m_000000_1, Status : FAILED
parquet.io.ParquetDecodingException: Can not read value at 0 in block 0 in
file hdfs://had.ca:9000/tmp/avro/2014_10_14/part-00000
        at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
        at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456)
        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.lang.Cl
attempt_201410141621_22583_m_000000_2: Dec 4, 2014 11:47:56 AM INFO:
parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will
read a total of 1000001 records.
attempt_201410141621_22583_m_000000_2: Dec 4, 2014 11:47:56 AM INFO:
parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
attempt_201410141621_22583_m_000000_2: Dec 4, 2014 11:47:56 AM INFO:
parquet.hadoop.InternalParquetRecordReader: block read in memory in 338 ms.
row count = 603147
attempt_201410141621_22583_m_000000_2: SLF4J: Failed to load class
"org.slf4j.impl.StaticLoggerBinder".
attempt_201410141621_22583_m_000000_2: SLF4J: Defaulting to no-operation
(NOP) logger implementation
attempt_201410141621_22583_m_000000_2: SLF4J: See
http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

I am wondering if I set the schema correctly. Can you give me some
suggestions?

Thanks,
Yan



On Thu, Dec 4, 2014 at 10:56 AM, Ryan Blue <[email protected]> wrote:

> On 12/04/2014 10:43 AM, Yan Qi wrote:
>
>> Hi Ryan,
>>
>> Thanks for your quick reply!
>>
>> Probably you're right. The projected schema has 4 columns (out of 13 in
>> the
>> read schema). If that's the problem, how does the read schema get
>> FILTERED?
>> I thought the read schema should be always the same as the file schema
>> (i.e., Profile.getClassSchema()), right?
>>
>> Thanks,
>> Yan
>>
>
> The read schema is the schema that your application expects. If you rely
> on 4 data fields in your application, then your read schema should reflect
> that. The reason why the read and projection schemas are separate is that
> you might want to load 4 columns of data, but the object you're using has
> more fields that you'll just ignore. In that case, you don't mind that
> those are set to default values instead of data values.
>
> I actually think we need to fix how this works and derive the projection
> schema from the read schema and the file schema. That way, we wouldn't
> default columns that you don't want loaded and you select all columns from
> the data if it exists there.
>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: Performance Issue with Parquet-Avro

Reply via email to