Re: read binary files (for stream reader) / spark 2.3

2019-09-09 Thread Peter Liu
-in-apache-spark/ > > > > > https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.image.ImageSchema$ > > > > There’s also a spark package for spark versions older than 2.3: > > https://github.com/Microsoft/spark-images > > > > Thank yo

Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
Hello experts, I have quick question: which API allows me to read images files or binary files (for SparkSession.readStream) from a local/hadoop file system in Spark 2.3? I have been browsing the following documentations and googling for it and didn't find a good example/documentation:

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
tion or just directly read a part > from other's jvm shuffle file. But yes, it's not available in spark out of > box. > > Thanks, > Peter Rudenko > > пт, 19 жовт. 2018 о 16:54 Peter Liu пише: > >> Hi Peter, >> >> thank you for the reply and det

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
ld get better > performance. > > Thanks, > Peter Rudenko > > чт, 18 жовт. 2018 о 18:07 Peter Liu пише: > >> I would be very interested in the initial question here: >> >> is there a production level implementation for memory only shuffle and >> configurable

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
I would be very interested in the initial question here: is there a production level implementation for memory only shuffle and configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK storage level) as mentioned in this ticket, https://github.com/apache/spark/pull/5403 ? It would be

re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
Hi there, is there any best practice guideline on yarn resource overcommit with cpu / vcores, such as yarn config options, candidate cases ideal for overcommiting vcores etc.? this slide below (from 2016) seems to address the memory overcommit topic and hint a "future" topic on cpu overcommit:

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
why it's important than > your throughput is higher than your input rate. If it's not, batches will > become bigger and bigger and take longer and longer until the application > fails > > > > On Thu, Aug 2, 2018 at 2:43 PM Peter Liu wrote: > >> Hello there, >>

re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
Hello there, I'm new to spark streaming and have trouble to understand spark batch "composition" (google search keeps give me an older spark streaming concept). Would appreciate any help and clarifications. I'm using spark 2.2.1 for a streaming workload (see quoted code in (a) below). The

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
Hello there, I just upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload and got the error (java.lang.AbstractMethodError) never seen before; check the error stack attached in (a) bellow. anyone knows if spark 2.3.1 works well with kafka spark-streaming-kafka-0-10? this link

re: streaming - kafka partition transition time from (stage change logger)

2018-06-11 Thread Peter Liu
Hi there, Working on the streaming processing latency time based on timestamps from Kafka, I have two quick general questions triggered by looking at the kafka stage change log file: (a) the partition state change from OfflineReplica state *to OnlinePartition *state seems to take more than 20

Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
//about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Thu, May 24,

re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
Hi there, from my apache spark streaming website (see links below), - the batch-interval is set when a spark StreamingContext is constructed (see example (a) quoted below) - the StreamingContext is available in older and new Spark version (v1.6, v2.2 to v2.3.0) (see

Re: Advice on multiple streaming job

2018-05-08 Thread Peter Liu
Hi Dhaval, I'm using Yarn scheduler (without the need to specify the port in the submit). Not sue why the port issue here. Gerard seem to have a good point here to have the multiple topics managed within your application (to avoid the port issue) - Not sure if you're using Spark Streaming or

re: spark streaming / AnalysisException on collect()

2018-04-30 Thread Peter Liu
Hello there, I have a quick question regarding how to share data (a small data collection) between a kafka producer and consumer using spark streaming (spark 2.2): (A) the data published by a kafka producer is received in order on the kafka consumer side (see (a) copied below). (B) however,