Re: Beam presentation materials question

2021-10-29 Thread Kenneth Knowles
I do have access. The sharing is set up so "anyone on the internet can
view" via the link
https://drive.google.com/drive/folders/0B-IhJZh9Ab52a3JLVXFWMDltcHM?resourcekey=0-VP9gZcuYGybKzNARikCEnQ=sharing
.

Kenn

On Fri, Oct 29, 2021 at 2:06 PM Udi Meiri  wrote:

> I don't have access either. It seems that link is from 2016.
> This blog post still has some valid links:
> https://beam.apache.org/blog/presentation-materials/
>
> On Thu, Oct 28, 2021 at 4:22 PM Melissa Pashniak 
> wrote:
>
>> Hi all,
>>
>> I was interested in looking at the Beam presentation materials, but the
>> embedded frame on the Apache Beam Presentation Materials page [1] gives me
>> a Google Drive "Request access" form. Does anyone know who owns the folder
>> and/or what happened to it?
>>
>> [1] https://beam.apache.org/community/presentation-materials/
>>
>>


Re: P0 (outage) report

2021-10-29 Thread Kenneth Knowles
Thank you!

On Thu, Oct 28, 2021 at 11:27 AM Kyle Weaver  wrote:

> I downgraded these to P1 and P2, respectively.
>
> On Thu, Oct 28, 2021 at 11:03 AM Beam Jira Bot  wrote:
>
>> This is your daily summary of Beam's current outages. See
>> https://beam.apache.org/contribute/jira-priorities/#p0-outage for the
>> meaning and expectations around P0 issues.
>>
>> BEAM-13137: make
>> ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes
>> deterministic (https://issues.apache.org/jira/browse/BEAM-13137)
>> BEAM-13136: Clean leftovers of old ESIO versions / test mechanism (
>> https://issues.apache.org/jira/browse/BEAM-13136)
>>
>


Flaky test issue report (30)

2021-10-29 Thread Beam Jira Bot
This is your daily summary of Beam's current flaky tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20labels%20%3D%20flake)

These are P1 issues because they have a major negative impact on the community 
and make it hard to determine the quality of the software.

https://issues.apache.org/jira/browse/BEAM-13083: XVR Flink and XVR Spark 
failing (created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13025: 
beam_PostCommit_Java_DataflowV2 failing pubsublite.ReadWriteIT (created 
2021-10-08)
https://issues.apache.org/jira/browse/BEAM-12928: beam_PostCommit_Python36 
- CrossLanguageSpannerIOTest - flakey failing (created 2021-09-21)
https://issues.apache.org/jira/browse/BEAM-12859: 
org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
 is flaky (created 2021-09-08)
https://issues.apache.org/jira/browse/BEAM-12809: 
testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky (created 2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12794: 
PortableRunnerTestWithExternalEnv.test_pardo_timers flaky (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12540: 
beam_PostRelease_NightlySnapshot - Task 
:runners:direct-java:runMobileGamingJavaDirect FAILED (created 2021-06-25)
https://issues.apache.org/jira/browse/BEAM-12515: Python PreCommit flaking 
in PipelineOptionsTest.test_display_data (created 2021-06-18)
https://issues.apache.org/jira/browse/BEAM-12322: Python precommit flaky: 
Failed to read inputs in the data plane (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12320: 
PubsubTableProviderIT.testSQLSelectsArrayAttributes[0] failing in SQL 
PostCommit (created 2021-05-10)
https://issues.apache.org/jira/browse/BEAM-12291: 
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: 
false] is flaky (created 2021-05-05)
https://issues.apache.org/jira/browse/BEAM-12200: 
SamzaStoreStateInternalsTest is flaky (created 2021-04-20)
https://issues.apache.org/jira/browse/BEAM-12163: Python GHA PreCommits 
flake with grpc.FutureTimeoutError on SDK harness startup (created 2021-04-13)
https://issues.apache.org/jira/browse/BEAM-12061: beam_PostCommit_SQL 
failing on KafkaTableProviderIT.testFakeNested (created 2021-03-27)
https://issues.apache.org/jira/browse/BEAM-11837: Java build flakes: 
"Memory constraints are impeding performance" (created 2021-02-18)
https://issues.apache.org/jira/browse/BEAM-11661: hdfsIntegrationTest 
flake: network not found (py38 postcommit) (created 2021-01-19)
https://issues.apache.org/jira/browse/BEAM-11641: Bigquery Read tests are 
flaky on Flink runner in Python PostCommit suites (created 2021-01-15)
https://issues.apache.org/jira/browse/BEAM-11541: 
testTeardownCalledAfterExceptionInProcessElement flakes on direct runner. 
(created 2020-12-30)
https://issues.apache.org/jira/browse/BEAM-10955: Flink Java Runner test 
flake: Could not find Flink job (FlinkJobNotFoundException) (created 2020-09-23)
https://issues.apache.org/jira/browse/BEAM-10866: 
PortableRunnerTestWithSubprocesses.test_register_finalizations flaky on macOS 
(created 2020-09-09)
https://issues.apache.org/jira/browse/BEAM-10485: Failure / flake: 
ElasticsearchIOTest > testWriteWithIndexFn (created 2020-07-14)
https://issues.apache.org/jira/browse/BEAM-9649: 
beam_python_mongoio_load_test started failing due to mismatched results 
(created 2020-03-31)
https://issues.apache.org/jira/browse/BEAM-8101: Flakes in 
ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for 
Direct, Spark, Flink (created 2019-08-27)
https://issues.apache.org/jira/browse/BEAM-8035: 
WatchTest.testMultiplePollsWithManyResults flake: Outputs must be in timestamp 
order (sickbayed) (created 2019-08-22)
https://issues.apache.org/jira/browse/BEAM-7827: 
MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on 
DirectRunner (created 2019-07-26)
https://issues.apache.org/jira/browse/BEAM-7752: Java Validates 
DirectRunner: testTeardownCalledAfterExceptionInFinishBundleStateful flaky 
(created 2019-07-16)
https://issues.apache.org/jira/browse/BEAM-6804: [beam_PostCommit_Java] 
[PubsubReadIT.testReadPublicData] Timeout waiting on Sub (created 2019-03-11)
https://issues.apache.org/jira/browse/BEAM-5286: 
[beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
 .sh script: text file busy. (created 2018-09-01)
https://issues.apache.org/jira/browse/BEAM-5172: 
org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest is flaky (created 
2018-08-20)


P1 issues report (55)

2021-10-29 Thread Beam Jira Bot
This is your daily summary of Beam's current P1 issues, not including flaky 
tests 
(https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20statusCategory%20!%3D%20Done%20AND%20priority%20%3D%20P1%20AND%20(labels%20is%20EMPTY%20OR%20labels%20!%3D%20flake).

See https://beam.apache.org/contribute/jira-priorities/#p1-critical for the 
meaning and expectations around P1 issues.

https://issues.apache.org/jira/browse/BEAM-13137: make 
ElasticsearchIO$BoundedElasticsearchSource#getEstimatedSizeBytes deterministic 
(created 2021-10-28)
https://issues.apache.org/jira/browse/BEAM-13118: Nightly build wheels jobs 
failing (created 2021-10-26)
https://issues.apache.org/jira/browse/BEAM-13098: Repeated fields not 
translated correctly in BigQuery Storage API sink (created 2021-10-21)
https://issues.apache.org/jira/browse/BEAM-13087: 
apache_beam.runners.portability.fn_api_runner.translations_test.TranslationsTest.test_run_packable_combine_globally
 'apache_beam.coders.coder_impl._AbstractIterable' object is not reversible 
(created 2021-10-20)
https://issues.apache.org/jira/browse/BEAM-13078: Python DirectRunner does 
not emit data at GC time (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13076: Python AfterAny, AfterAll 
do not follow spec (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13074: Metrics are not reported 
by the Flink runner (created 2021-10-18)
https://issues.apache.org/jira/browse/BEAM-13060: Daily Python SDK build is 
not publicly accessible (created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13059: Migrate GKE workloads to 
Containerd (created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13058: Upgrade Kubernetes APIs 
(created 2021-10-15)
https://issues.apache.org/jira/browse/BEAM-13056: Add method to fetch 
ProcessContext FieldAccess (created 2021-10-14)
https://issues.apache.org/jira/browse/BEAM-13025: 
beam_PostCommit_Java_DataflowV2 failing pubsublite.ReadWriteIT (created 
2021-10-08)
https://issues.apache.org/jira/browse/BEAM-13010: Delete orphaned files 
(created 2021-10-06)
https://issues.apache.org/jira/browse/BEAM-12995: Consumer group with 
random prefix (created 2021-10-04)
https://issues.apache.org/jira/browse/BEAM-12959: Dataflow error in 
CombinePerKey operation (created 2021-09-26)
https://issues.apache.org/jira/browse/BEAM-12867: Either Create or 
DirectRunner fails to produce all elements to the following transform (created 
2021-09-09)
https://issues.apache.org/jira/browse/BEAM-12843: (Broken Pipe induced) 
Bricked Dataflow Pipeline  (created 2021-09-06)
https://issues.apache.org/jira/browse/BEAM-12818: When writing to GCS, 
spread prefix of temporary files and reuse autoscaling of the temporary 
directory (created 2021-08-30)
https://issues.apache.org/jira/browse/BEAM-12807: Java creates an incorrect 
pipeline proto when core-construction-java jar is not in the CLASSPATH (created 
2021-08-26)
https://issues.apache.org/jira/browse/BEAM-12792: Beam worker only installs 
--extra_package once (created 2021-08-24)
https://issues.apache.org/jira/browse/BEAM-12766: Already Exists: Dataset 
apache-beam-testing:python_bq_file_loads_NNN (created 2021-08-16)
https://issues.apache.org/jira/browse/BEAM-12632: ElasticsearchIO: Enabling 
both User/Pass auth and SSL overwrites User/Pass (created 2021-07-16)
https://issues.apache.org/jira/browse/BEAM-12540: 
beam_PostRelease_NightlySnapshot - Task 
:runners:direct-java:runMobileGamingJavaDirect FAILED (created 2021-06-25)
https://issues.apache.org/jira/browse/BEAM-12525: SDF BoundedSource seems 
to execute significantly slower than 'normal' BoundedSource (created 2021-06-22)
https://issues.apache.org/jira/browse/BEAM-12505: codecov/patch has poor 
behavior (created 2021-06-17)
https://issues.apache.org/jira/browse/BEAM-12500: Dataflow SocketException 
(SSLException) error while trying to send message from Cloud Pub/Sub to 
BigQuery (created 2021-06-16)
https://issues.apache.org/jira/browse/BEAM-12484: JdbcIO date conversion is 
sensitive to OS (created 2021-06-14)
https://issues.apache.org/jira/browse/BEAM-12467: 
java.io.InvalidClassException With Flink Kafka (created 2021-06-09)
https://issues.apache.org/jira/browse/BEAM-12279: Implement 
destination-dependent sharding in FileIO.writeDynamic (created 2021-05-04)
https://issues.apache.org/jira/browse/BEAM-12256: 
PubsubIO.readAvroGenericRecord creates SchemaCoder that fails to decode some 
Avro logical types (created 2021-04-29)
https://issues.apache.org/jira/browse/BEAM-11959: Python Beam SDK Harness 
hangs when installing pip packages (created 2021-03-11)
https://issues.apache.org/jira/browse/BEAM-11906: No trigger early 
repeatedly for session windows (created 2021-03-01)
https://issues.apache.org/jira/browse/BEAM-11875: XmlIO.Read does not 
handle XML encoding per spec (created 2021-02-26)

Re: Question: Read IOs Fusion

2021-10-29 Thread Alexey Romanenko
If I understand correctly, with Avro predicates push-down, it will filter out 
the records on the level of Avro file reading and these records won’t even be 
processed by Beam. So, filtering is happening on a “low” level of a file reader 
and it’s very effective, especially, if you need only 22M records from 8B.

On the other hand, with fusion but without predicates pushdown, it will read 
all data with AvroIO, process it upstream of your pipeline and only then filter 
out the records. So, fusion should help with ser/deser and copying the data 
over the network from one workers to another, but it won’t help with 
eliminating of reading all this 8B of records. I’m not familiar with fusion 
implementation details on Dataflow, so maybe I’m mistaken with this, but this 
is how I understand it works.

We observed the similar results as you with ParquetIO/GenericRecords and schema 
projection/predicate push-down (ParquetIO already supports it [1]). Imho, it 
means that filtering on “low” level of your storage format is much more 
effective than on more “higher" data processing level.

There is also ongoing work to add projection pushdown across Schema-aware 
PTransforms in the Beam Java SDK [2]. So, once it will be implemented, then 
pushdown should be more transparent for users, at least, with using Beam SQL 
transforms (though, I hope with other Schema-aware PTransforms as well) and 
work automatically with some IOs that support this. 

[1] https://issues.apache.org/jira/browse/BEAM-7925 

[2] 
https://docs.google.com/document/d/1eHSO3aIsAUmiVtfDL-pEFenNBRKt26KkF4dQloMhpBQ/
 



—
Alexey


> On 29 Oct 2021, at 02:52, Kirill Panarin  wrote:
> 
> We are using DataflowRunner. The job is actually implemented in scio 
> : 
> 
> // ASchema extends SpecificRecord
> 
> object ReadFullWithFilterJob {
> 
>   def main(cmdArgs: Array[String]): Unit = {
> 
> val (sc, args) = ContextAndArgs(cmdArgs)
> val inputPath = args("input")
> 
> sc.avroFile[ASchema](inputPath)
>   .filter(x => x.field == "value")
>   .count
> 
> sc.run()
>   }
> 
> }
> 
> When translated to Beam it would look like this:
> 
> import org.apache.avro.specific.SpecificRecord;
> import org.apache.beam.sdk.Pipeline;
> import org.apache.beam.sdk.io.AvroIO;
> import org.apache.beam.sdk.transforms.Count;
> import org.apache.beam.sdk.transforms.Filter;
> import org.apache.beam.sdk.values.PCollection;
> 
> public class Test {
> 
>   public static void main(String[] args) {
> Pipeline p = null;
> 
> // Read Avro-generated classes from files on GCS
> PCollection count =
> 
> p.apply(AvroIO.read(AvroAutoGenClass.class).from("gs://my_bucket/path/to/records-*.avro"))
> .apply(Filter.by(x -> x.field == "value"))
> .apply(Count.globally());
> 
>   }
> }
> So basically just AvroIO -> Filter -> Count.
> 
> The job which runs faster uses the tweaked AvroSource.readNextRecord 
> :
> 
> @Override
> public boolean readNextRecord() throws IOException {
>   if (currentRecordIndex >= numRecords) {
> return false;
>   }
>   Object record = reader.read(null, decoder);
>   boolean recordRead = false;
> 
>   while(currentRecordIndex < numRecords) {
> currentRecord =
> (mode.parseFn == null) ? ((T) record) : 
> mode.parseFn.apply((GenericRecord) record);
> 
> currentRecordIndex++;
> 
> if (mode.filterFn == null || ((SerializableFunction Boolean>)mode.filterFn).apply(currentRecord)) {
>   recordRead = true;
>   break;
> }
>   }
>   return recordRead;
> }
> 
> filterFn is provided by a user from outside and eagerly filters out records 
> e.g. (in scio):
> 
> object ReadFullWithFilterPushdownJob {
> 
>   def main(cmdArgs: Array[String]): Unit = {
> 
> val (sc, args) = ContextAndArgs(cmdArgs)
> val inputPath = args("input")
> 
> sc.customInput("predicatedAvro",
>   new AvroRead[EndContentFactXTDaily](inputPath, 
> classOf[AvroAutoGenClass],
> x => x.field == "value")
> )
> 
> sc.run()
>   }
> 
> }
> 
> On Thu, Oct 28, 2021 at 8:07 PM Reuven Lax  > wrote:
> It would also help if you shared your code.
> 
> On Thu, Oct 28, 2021 at 5:05 PM Reuven Lax  > wrote:
> Which runner are we using? Fusions (at least for non-portable pipelines) is 
> generally implemented by the runner, not by Beam.
> 
> On Thu, Oct 28, 2021 at 3:05 PM Kirill Panarin  > wrote:
> Hello !
> 
> We did some experiments with pushing filter query down to the AvroSource and 
> it significantly improved the performance of the job. The test job was 
> simple: it reads about 8B of Avro specific records and filters them down to 
> 22M