[jira] [Commented] (TEZ-3113) massive increase of run time using PipelinedSorter rather than DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435318#comment-15435318 ] Hitesh Shah commented on TEZ-3113: -- [~mingma] [~cchepelov] any chance of updating the logs to help [~rajesh.balamohan] debug further? > massive increase of run time using PipelinedSorter rather than DefaultSorter > > > Key: TEZ-3113 > URL: https://issues.apache.org/jira/browse/TEZ-3113 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: scalding 0.15-SNAPSHOT per > https://github.com/twitter/scalding/pull/1446 > cascading 3.1.0-wip-54 > tez-0.8.2 > OpenJDK 8 on AMD64 > Hadoop 2.6.0 (YARN, HDFS); Apache distribution > Debian Linux 8 > 8 * Intel Core i7-3770K >Reporter: Cyrille Chépélov > > While running a (fairly complex) scalding DAG that was working fine using > tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large. > Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour. > Difficulties can be traced to this shape of code: > {code:scala} > val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset > x > .group > .mapValues(x => 1L) > .sum > .write(TypedTsvHeader("foo.tsv", ('key, 'count))) > {code} > where the incoming data contains many, many different keys. Observed > behaviour of PipelinedSorter is that several hundred thousand different files > are put flat in the same per-TezChild local temporary directories, and thing > become very slow (not alleging any causality). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3113) massive increase of run time using PipelinedSorter rather than DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15394807#comment-15394807 ] Ming Ma commented on TEZ-3113: -- We also ran into similar issue. The pipeline sorter setting: {noformat} 2016-07-26 18:05:48,398 [INFO] [TezChild] |impl.PipelinedSorter|: Setting up PipelinedSorter for CB70EF1A28614FED8BA2F7C7D271D11F: , UsingHashComparator=false#blocks=1, maxMemUsage=104857600, lazyAllocateMem=false, minBlockSize=2097152000, initial BLOCK_SIZE=104857600, finalMergeEnabled=true, pipelinedShuffle=false, sendEmptyPartitions=true, tez.runtime.io.sort.mb=100 {noformat} >From the call stack below, regularly a single record is larger than sort >buffer size and thus caused the sorter to spill the single record. {noformat} at cascading.tuple.hadoop.TupleSerialization.getSerializationFactory(TupleSerialization.java:296) at cascading.tuple.hadoop.TupleSerialization.getNewSerializer(TupleSerialization.java:602) at cascading.tuple.hadoop.TupleSerialization$SerializationElementWriter.getSerializer(TupleSerialization.java:1033) at cascading.tuple.hadoop.TupleSerialization$SerializationElementWriter.write(TupleSerialization.java:1013) at cascading.tuple.io.TupleOutputStream.writeElement(TupleOutputStream.java:170) at cascading.tuple.io.TupleOutputStream.writeUnTyped(TupleOutputStream.java:135) at cascading.tuple.hadoop.io.BaseSerializer.serialize(BaseSerializer.java:72) at cascading.tuple.hadoop.io.BaseSerializer.serialize(BaseSerializer.java:32) at org.apache.tez.runtime.library.common.sort.impl.IFile$Writer.append(IFile.java:277) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.spillSingleRecord(PipelinedSorter.java:484) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:409) - locked <0x0007682297a8> (a org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:415) - locked <0x0007682297a8> (a org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.collect(PipelinedSorter.java:415) - locked <0x0007682297a8> (a org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter) at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.write(PipelinedSorter.java:362) at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput$1.write(OrderedPartitionedKVOutput.java:163) {noformat} There are lots of entries like this in the container log. {noformat} 2016-07-26 00:06:16,279 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: done sorting span=0, length=0, time=0 2016-07-26 00:06:16,279 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: reserved.remaining()=104857600, reserved.metasize=16777216 2016-07-26 00:06:16,279 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: done sorting span=0, length=0, time=0 2016-07-26 00:06:16,279 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: reserved.remaining()=104857600, reserved.metasize=16777216 2016-07-26 00:06:16,279 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: done sorting span=0, length=0, time=0 2016-07-26 00:06:16,280 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: reserved.remaining()=104857600, reserved.metasize=16777216 2016-07-26 00:06:16,280 [INFO] [TezChild] |impl.PipelinedSorter|: C1A0DD178C964C6F90553E14996CD011: Spilling to /.../application_1466709597073_272545/output/attempt_1466709597073_272545_1_02_001204_0_10007_14/file.out, indexFilename=/.../application_1466709597073_272545/output/attempt_1466709597073_272545_1_02_001204_0_10007_14/file.out.index {noformat} > massive increase of run time using PipelinedSorter rather than DefaultSorter > > > Key: TEZ-3113 > URL: https://issues.apache.org/jira/browse/TEZ-3113 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: scalding 0.15-SNAPSHOT per > https://github.com/twitter/scalding/pull/1446 > cascading 3.1.0-wip-54 > tez-0.8.2 > OpenJDK 8 on AMD64 > Hadoop 2.6.0 (YARN, HDFS); Apache distribution > Debian Linux 8 > 8 * Intel Core i7-3770K >Reporter: Cyrille Chépélov > > While running a (fairly complex) scalding DAG that was working fine using > tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large. > Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour. > Difficulties can be traced to this
[jira] [Commented] (TEZ-3113) massive increase of run time using PipelinedSorter rather than DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15339044#comment-15339044 ] Tsuyoshi Ozawa commented on TEZ-3113: - > Observed behaviour of PipelinedSorter is that several hundred thousand > different files are put flat in the same per-TezChild local temporary > directories, and thing become very slow (not alleging any causality) I think this behaviour can cause file-system level lock contention if lots threads access. One possible solution is partitioning by adding upper limits per directory or changing directory structure. [~rajesh.balamohan] what do you think? > massive increase of run time using PipelinedSorter rather than DefaultSorter > > > Key: TEZ-3113 > URL: https://issues.apache.org/jira/browse/TEZ-3113 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: scalding 0.15-SNAPSHOT per > https://github.com/twitter/scalding/pull/1446 > cascading 3.1.0-wip-54 > tez-0.8.2 > OpenJDK 8 on AMD64 > Hadoop 2.6.0 (YARN, HDFS); Apache distribution > Debian Linux 8 > 8 * Intel Core i7-3770K >Reporter: Cyrille Chépélov > > While running a (fairly complex) scalding DAG that was working fine using > tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large. > Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour. > Difficulties can be traced to this shape of code: > {code:scala} > val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset > x > .group > .mapValues(x => 1L) > .sum > .write(TypedTsvHeader("foo.tsv", ('key, 'count))) > {code} > where the incoming data contains many, many different keys. Observed > behaviour of PipelinedSorter is that several hundred thousand different files > are put flat in the same per-TezChild local temporary directories, and thing > become very slow (not alleging any causality). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3113) massive increase of run time using PipelinedSorter rather than DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-3113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142954#comment-15142954 ] Rajesh Balamohan commented on TEZ-3113: --- Is it possible to share the logs and the sort.mb settings for this job? > massive increase of run time using PipelinedSorter rather than DefaultSorter > > > Key: TEZ-3113 > URL: https://issues.apache.org/jira/browse/TEZ-3113 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.8.2 > Environment: scalding 0.15-SNAPSHOT per > https://github.com/twitter/scalding/pull/1446 > cascading 3.1.0-wip-54 > tez-0.8.2 > OpenJDK 8 on AMD64 > Hadoop 2.6.0 (YARN, HDFS); Apache distribution > Debian Linux 8 > 8 * Intel Core i7-3770K >Reporter: Cyrille Chépélov > > While running a (fairly complex) scalding DAG that was working fine using > tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large. > Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour. > Difficulties can be traced to this shape of code: > {code:scala} > val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset > x > .group > .mapValues(x => 1L) > .sum > .write(TypedTsvHeader("foo.tsv", ('key, 'count))) > {code} > where the incoming data contains many, many different keys. Observed > behaviour of PipelinedSorter is that several hundred thousand different files > are put flat in the same per-TezChild local temporary directories, and thing > become very slow (not alleging any causality). -- This message was sent by Atlassian JIRA (v6.3.4#6332)