Re: Lazy Spark Structured Streaming

2020-08-02 Thread Phillip Henry
avior on > OutputMode.Append you would need to add a dummy record to move watermark > forward. > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry > wrote: > >> Sorry, should have mentioned that Spa

Re: Lazy Spark Structured Streaming

2020-07-27 Thread Phillip Henry
struggling. Can anybody please help? How do people test their SSS code if you have to put a message on Kafka to get Spark to consume a batch? Kind regards, Phillip On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry wrote: > Hi, folks. > > I noticed that SSS won't process a wait

Lazy Spark Structured Streaming

2020-07-12 Thread Phillip Henry
Hi, folks. I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed. There is a JIRA for this at: https://issues.apache.org/jira/browse/SPARK-24156 that says it's resolved in

Re: OptionalDataException in spark

2019-12-14 Thread Phillip Henry
Strongly suspect you're mutating an object at the point in time it is Serialized. I suggest you remove all mutation from your code. HTH. Phillip On Fri, Dec 13, 2019 at 7:19 PM Ravi Aggarwal wrote: > Hi, > > > > We are encountering java OptionalDataException in one of our spark jobs. > >

Re: GraphX performance feedback

2019-11-28 Thread Phillip Henry
I saw a large improvement in my GraphX processing by: - using fewer partitions - using fewer executors but with much more memory. YMMV. Phillip On Mon, 25 Nov 2019, 19:14 mahzad kalantari, wrote: > Thanks for your answer, my use case is friend recommandation for 200 > million profils. > > Le

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry
Hi, Arwin. If I understand you correctly, this is totally expected behaviour. I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the

Re: Data growth vs Cluster Size planning

2019-02-12 Thread Phillip Henry
Too little information to give an answer, if indeed an answer a priori is possible. However, I would do the following on your test instances: - Run jstat -gc on all your nodes. It might be that the GC is taking a lot of time. - Poll with jstack semi frequently. I can give you a fairly good idea

Re: structured streaming handling validation and json flattening

2019-02-12 Thread Phillip Henry
Hi, I'm in a somewhat similar situation. Here's what I do (it seems to be working so far): 1. Stream in the JSON as a plain string. 2. Feed this string into a JSON library to validate it (I use Circe). 3. Using the same library, parse the JSON and extract fields X, Y and Z. 4. Create a dataset

Re: Reading as Parquet a directory created by Spark Structured Streaming - problems

2019-01-11 Thread Phillip Henry
Hi, Denis. It should be a String. Even if it looks like a number when you do hadoop fs -ls ..., it's a String representation of a date/time. Phillip On Thu, Jan 10, 2019 at 2:00 PM ddebarbieux wrote: > cala> spark.read.schema(StructType(Seq(StructField("_1",StringType,false), >

Reading as Parquet a directory created by Spark Structured Streaming - problems

2019-01-09 Thread Phillip Henry
Hi, I write a stream of (String, String) tuples to HDFS partitioned by the first ("_1") member of the pair. Everything looks great when I list the directory via "hadoop fs -ls ...". However, when I try to read all the data as a single dataframe, I get unexpected results (see below). I notice

Using checkpoint much, much faster than cache. Why?

2018-06-05 Thread Phillip Henry
Hi, folks. I am using Spark 2.2.0 and a combination of Spark ML's LinearSVC and OneVsRest to classify some documents that are originally read from HDFS using sc.wholeTextFiles. When I use it on small documents, I get the results in a few minutes. When I use it on the same number of large

Re: Multiple transformations without recalculating or caching

2017-11-19 Thread Phillip Henry
A back-of-a-beermat calculation says if you have, say, 20 boxes, saving 1TB should take approximately 15 minutes (with a replication factor of 1 since you don't need it higher for ephemeral data that is relatively easy to generate). This isn't much if the whole job takes hours. You get the added

Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
disk, I might be wrong but not all functionality of > spark is spill to disk. So it still doesn't provide DB like reliability in > execution. In case of DBs, queries get slow but they don't fail or go out > of memory, specifically in concurrent user scenarios. > > Regards, > Ashish >

Re: Spark based Data Warehouse

2017-11-12 Thread Phillip Henry
Agree with Jorn. The answer is: it depends. In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers). Regarding sharing resources, different teams were limited to their own queue so they

Re: Matrix multiplication and cluster / partition / blocks configuration

2017-08-08 Thread Phillip Henry
Hi, John. I've had similar problems. IIRC, the driver was GCing madly. I don't know why the driver was doing so much work but I quickly implemented an alternative approach. The code I wrote belongs to my client but I wrote something that should be equivalent. It can be found at:

Re: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

2016-11-18 Thread Phillip Henry
Looks like a standard "not enough memory" issue. I can only recommend the usual advice of increasing the number of partitions to give you a quick-win. Also, your JVMs have an enormous amount of memory. This may cause long GC pause times. You might like to try reducing the memory to about 20gb and

Distributed matrices with column counts represented by Int (rather than Long)

2016-05-26 Thread Phillip Henry
Hi, I notice that some DistributedMatrix represent the number of columns with an Int rather than a Long (RowMatrix etc). This limits the number of columns to about 2 billion. We're approaching that limit. What do people recommend we do to mitigate the problem? Are there plans to use a larger