Databricks - number of executors, shuffle.partitions etc

2019-05-14 Thread Rishi Shah
Hi All, How can we set spark conf parameter in databricks notebook? My cluster doesn't take into account any spark.conf.set properties... it creates 8 worker nodes (dat executors) but doesn't honor the supplied conf parameters. Any idea? -- Regards, Rishi Shah

Spark job gets hung on cloudera cluster

2019-05-14 Thread Rishi Shah
Hi All, At times when there's a data node failure, running spark job doesn't fail - it gets stuck and doesn't return. Any setting can help here? I would ideally like to get the job terminated or executors running on those data nodes fail... -- Regards, Rishi Shah

In spark 2.4 Upgrade Apache Arrow to version 0.12.0

2019-05-14 Thread 李斌松
Pyarrow version 0.12.1, arrow jar version 0.10, can run correctly. Pyarrow version 0.121, arrow jar version 0.12, this exception occurs: > *Expected schema message in stream, was null or length 0* Pyarrow was upgraded to version 0.12.1 by other package dependency, resulting in inconsistency

Re: Koalas show data in IDE or pyspark

2019-05-14 Thread Reynold Xin
This has been fixed and was included in the release 0.3 last week. We will be making another release (0.4) in the next 24 hours to include more features also. On Tue, Apr 30, 2019 at 12:42 AM, Manu Zhang < owenzhang1...@gmail.com > wrote: > > Hi, > > > It seems koalas.DataFrame can't be

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Gary Gao
Thanks for reply, Jean In my project , I'm working on higher abstraction layer of spark streaming to build a data processing product and trying to provide a common api for java and scala developers. You can see the abstract class defined here:

Re: BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-05-14 Thread Jason Dai
The slides for the talks have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Fri, Apr 19, 2019 at 9:55 PM Khare, Ankit wrote: > Thanks for sharing. > > Sent from my iPhone > > On 19. Apr 2019, at 01:35, Jason Dai wrote: > > Hi all, > > > Please see

Re: What is the difference for the following UDFs?

2019-05-14 Thread Qian He
Hi Jacek, Thanks for your reply. Your provided case was actually same as my second option in my original email. What I'm wondering was the difference between those two regarding query performance or efficiency. On Tue, May 14, 2019 at 3:51 PM Jacek Laskowski wrote: > Hi, > > For this

Re: What is the difference for the following UDFs?

2019-05-14 Thread Jacek Laskowski
Hi, For this particular case I'd use Column.substr ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column), e.g. val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e") scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show +-+ | demo| +-+

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket, Anastasios Many thanks for your time and your suggestions! I tried again with various settings for the watermarks and the trigger time - watermark 20sec, trigger 2sec - watermark 10sec, trigger 1sec - watermark 20sec, trigger 0sec I also tried continuous processing mode, but since I

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-14 Thread Suket Arora
// Continuous trigger with one-second checkpointing intervaldf.writeStream .format("console") .trigger(Trigger.Continuous("1 second")) .start() On Tue, 14 May 2019 at 22:10, suket arora wrote: > Hey Austin, > > If you truly want to process as a stream, use continuous streaming in >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Suket Arora
df = inputStream.withWatermark("eventtime", "20 seconds").groupBy("sharedId", window("20 seconds", "10 seconds") // ProcessingTime trigger with two-seconds micro-batch interval df.writeStream .format("console") .trigger(Trigger.ProcessingTime("2 seconds")) .start() On Tue, 14 May

What is the difference for the following UDFs?

2019-05-14 Thread Qian He
For example, I have a dataframe with 3 columns: URL, START, END. For each url from URL column, I want to fetch a substring of it starting from START and ending at END. ++--+-+ |URL|START |END | ++--+-+

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Jean Georges Perrin
There are a little bit more than the list you specified nevertheless, some data types are not directly compatible between Scala and Java and requires conversion, so it’s good to not pollute your code with plenty of conversion and focus on using the straight API. I don’t remember from the top

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Anastasios On 5/14/19 4:15 PM, Anastasios Zouzias wrote: > Hi Joe, > > How often do you trigger your mini-batch? Maybe you can specify the trigger > time explicitly to a low value or even better set it off. > > See: >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi Suket Sorry, this was a typo in the pseudo-code I sent. Of course that what you suggested (using the same eventtime attribute for the watermark and the window) is what my code does in reality. Sorry, to confuse people. On 5/14/19 4:14 PM, suket arora wrote: > Hi Joe, > As per the spark

Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Gary Gao
Hi all, I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we just use scala apis in java codes ? What's the difference ? Some examples of Java-Friendly APIs commented in Spark code are as follows: JavaDStream JavaInputDStream JavaStreamingContext JavaSparkContext

Re: Handling of watermark in structured streaming

2019-05-14 Thread Anastasios Zouzias
Hi Joe, How often do you trigger your mini-batch? Maybe you can specify the trigger time explicitly to a low value or even better set it off. See: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers Best, Anastasios On Tue, May 14, 2019 at 3:49 PM Joe

Re: Handling of watermark in structured streaming

2019-05-14 Thread suket arora
Hi Joe, As per the spark structured streaming documentation and I quote "withWatermark must be called on the same column as the timestamp column used in the aggregate. For example, df.withWatermark("time", "1 min").groupBy("time2").count() is invalid in Append output mode, as watermark is defined

Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann
Hi all I'm fairly new to Spark structured streaming and I'm only starting to develop an understanding for the watermark handling. Our application reads data from a Kafka input topic and as one of the first steps, it has to group incoming messages. Those messages come in bulks, e.g. 5 messages