Re: read binary files (for stream reader) / spark 2.3

2019-09-09 Thread Peter Liu
ta-support-in-apache-spark/ > > > > > https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.image.ImageSchema$ > > > > There’s also a spark package for spark versions older than 2.3: > > https://github.com/Microsoft/spark-images > > > >

Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
Hello experts, I have quick question: which API allows me to read images files or binary files (for SparkSession.readStream) from a local/hadoop file system in Spark 2.3? I have been browsing the following documentations and googling for it and didn't find a good example/documentation: https://s

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
ocket for local communication or just directly read a part > from other's jvm shuffle file. But yes, it's not available in spark out of > box. > > Thanks, > Peter Rudenko > > пт, 19 жовт. 2018 о 16:54 Peter Liu пише: > >> Hi Peter, >> >&

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
should get better > performance. > > Thanks, > Peter Rudenko > > чт, 18 жовт. 2018 о 18:07 Peter Liu пише: > >> I would be very interested in the initial question here: >> >> is there a production level implementation for memory only shuffle and >> configur

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
I would be very interested in the initial question here: is there a production level implementation for memory only shuffle and configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK storage level) as mentioned in this ticket, https://github.com/apache/spark/pull/5403 ? It would be

Re: configure yarn to use more vcores as the node provides?

2018-10-16 Thread Peter Liu
th this parameter and check how it affects your latency... > > Best, > > Khaled > > > > On Tue, Oct 16, 2018 at 3:06 AM Peter Liu wrote: > >> Hi Khaled, >> >> I have attached the spark streaming config below in (a). >> In case of the 100vcore run (

Re: overcommit: cpus / vcores

2018-10-15 Thread Peter Liu
mUUID()}") //TBD: ram disk? .outputMode("update") .start() (b) yarn.nodemanager.resource.cpu-vcores 110 yarn.scheduler.maximum-allocation-vcores 110 On Mon, Oct 15, 2018 at 4:26 PM Khaled Zaouk wrote: > Hi Peter, > > What parameters are you putting

re: overcommit: cpus / vcores

2018-10-15 Thread Peter Liu
Hi there, I have a system with 80 vcores and a relatively light spark streaming workload. Overcomming the vcore resource (i.e. > 80) in the config (see (a) below) seems to help to improve the average spark batch time (see (b) below). Is there any best practice guideline on resource overcommit wit

re: should we dump a warning if we drop batches due to window move?

2018-08-03 Thread Peter Liu
Hello there, I have a quick question for the following case: situation: a spark consumer is able to process 5 batches in 10 sec (where the batch interval is zero by default - correct me if this is wrong). the window size is 10 sec (zero overlapping sliding). there are some fluctuations in the inc

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-07-01 Thread Peter Liu
Hello there, I didn't get any response/help from the user list for the following question and thought people on the dev list might be able to help?: I upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload and got the error (java.lang.AbstractMethodError) never seen before; see the

re: sharing data via kafka broker using spark streaming/ AnalysisException on collect()

2018-04-30 Thread Peter Liu
Hello there, I have a quick question regarding how to share data (a small data collection) between a kafka producer and consumer using spark streaming (spark 2.2): (A) the data published by a kafka producer is received in order on the kafka consumer side (see (a) copied below). (B) however, coll