[
https://issues.apache.org/jira/browse/PIG-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Felix Gao updated PIG-1879:
---------------------------
Description:
I am using piggybank with avro_storage.patch from PIG-1748. When I load the
data using Load '/user/test/logs/avro/\*/\*' USING
org.apache.pig.piggybank.storage.avro.AvroStorage(); The sys log on the mapper
shows
2011-03-02 12:52:35,556 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-03-02 12:52:37,333 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
However, when I use load
'/user/test/logs/avro/{2011-03-01/*/*-23-00*,2011-03-01/*/*2011-03-01-00-00*,2011-03-01/*/*2011-03-01-01-00*,2011-03-01/*/*2011-03-01-02-00*,2011-03-01/*/*2011-03-01-03-00*,2011-03-01/*/*2011-03-01-04-00*,2011-02-28/*/*2011-02-28-05-00*,2011-02-28/*/*2011-02-28-06-00*,2011-02-28/*/*2011-02-28-07-00*,2011-02-28/*/*2011-02-28-08-00*,2011-02-28/*/*2011-02-28-09-00*,2011-02-28/*/*2011-02-28-1*-00*,2011-02-28/*/*2011-02-28-20-00*,2011-02-28/*/*2011-02-28-21-00*,2011-02-28/*/*2011-02-28-22-00*}'
The sys log on the mapper shows
2011-03-02 12:03:33,091 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-03-02 12:06:05,254 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
Notice it took 2 minute and 30 seconds on the initialization stage. If I cut
the number of file patterns in the glob to half. The mappers will be twice as
fast.
was:
I am using piggybank with avro_storage.patch from PIG-1748. When I load the
data using Load '/user/test/logs/avro/*/*' USING
org.apache.pig.piggybank.storage.avro.AvroStorage(); The sys log on the mapper
shows
2011-03-02 12:52:35,556 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-03-02 12:52:37,333 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
However, when I use load
'/user/test/logs/avro/{2011-03-01/*/*-23-00*,2011-03-01/*/*2011-03-01-00-00*,2011-03-01/*/*2011-03-01-01-00*,2011-03-01/*/*2011-03-01-02-00*,2011-03-01/*/*2011-03-01-03-00*,2011-03-01/*/*2011-03-01-04-00*,2011-02-28/*/*2011-02-28-05-00*,2011-02-28/*/*2011-02-28-06-00*,2011-02-28/*/*2011-02-28-07-00*,2011-02-28/*/*2011-02-28-08-00*,2011-02-28/*/*2011-02-28-09-00*,2011-02-28/*/*2011-02-28-1*-00*,2011-02-28/*/*2011-02-28-20-00*,2011-02-28/*/*2011-02-28-21-00*,2011-02-28/*/*2011-02-28-22-00*}'
The sys log on the mapper shows
2011-03-02 12:03:33,091 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-03-02 12:06:05,254 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100
Notice it took 2 minute and 30 seconds on the initialization stage. If I cut
the number of file patterns in the glob to half. The mappers will be twice as
fast.
> AvroStorage with wildcard causes extreme slow initialization
> ------------------------------------------------------------
>
> Key: PIG-1879
> URL: https://issues.apache.org/jira/browse/PIG-1879
> Project: Pig
> Issue Type: Bug
> Components: tools
> Affects Versions: 0.7.0
> Reporter: Felix Gao
>
> I am using piggybank with avro_storage.patch from PIG-1748. When I load the
> data using Load '/user/test/logs/avro/\*/\*' USING
> org.apache.pig.piggybank.storage.avro.AvroStorage(); The sys log on the
> mapper shows
> 2011-03-02 12:52:35,556 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-03-02 12:52:37,333 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
> 100
> However, when I use load
> '/user/test/logs/avro/{2011-03-01/*/*-23-00*,2011-03-01/*/*2011-03-01-00-00*,2011-03-01/*/*2011-03-01-01-00*,2011-03-01/*/*2011-03-01-02-00*,2011-03-01/*/*2011-03-01-03-00*,2011-03-01/*/*2011-03-01-04-00*,2011-02-28/*/*2011-02-28-05-00*,2011-02-28/*/*2011-02-28-06-00*,2011-02-28/*/*2011-02-28-07-00*,2011-02-28/*/*2011-02-28-08-00*,2011-02-28/*/*2011-02-28-09-00*,2011-02-28/*/*2011-02-28-1*-00*,2011-02-28/*/*2011-02-28-20-00*,2011-02-28/*/*2011-02-28-21-00*,2011-02-28/*/*2011-02-28-22-00*}'
> The sys log on the mapper shows
> 2011-03-02 12:03:33,091 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2011-03-02 12:06:05,254 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb =
> 100
> Notice it took 2 minute and 30 seconds on the initialization stage. If I
> cut the number of file patterns in the glob to half. The mappers will be
> twice as fast.
>
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira