date:20190514

Databricks - number of executors, shuffle.partitions etc

2019-05-14 Thread Rishi Shah

Hi All, How can we set spark conf parameter in databricks notebook? My cluster doesn't take into account any spark.conf.set properties... it creates 8 worker nodes (dat executors) but doesn't honor the supplied conf parameters. Any idea? -- Regards, Rishi Shah

Spark job gets hung on cloudera cluster

2019-05-14 Thread Rishi Shah

Hi All, At times when there's a data node failure, running spark job doesn't fail - it gets stuck and doesn't return. Any setting can help here? I would ideally like to get the job terminated or executors running on those data nodes fail... -- Regards, Rishi Shah

In spark 2.4 Upgrade Apache Arrow to version 0.12.0

2019-05-14 Thread 李斌松

Pyarrow version 0.12.1, arrow jar version 0.10, can run correctly. Pyarrow version 0.121, arrow jar version 0.12, this exception occurs: > *Expected schema message in stream, was null or length 0* Pyarrow was upgraded to version 0.12.1 by other package dependency, resulting in inconsistency

Re: Koalas show data in IDE or pyspark

2019-05-14 Thread Reynold Xin

This has been fixed and was included in the release 0.3 last week. We will be making another release (0.4) in the next 24 hours to include more features also. On Tue, Apr 30, 2019 at 12:42 AM, Manu Zhang < owenzhang1...@gmail.com > wrote: > > Hi, > > > It seems koalas.DataFrame can't be

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Gary Gao

Thanks for reply, Jean In my project , I'm working on higher abstraction layer of spark streaming to build a data processing product and trying to provide a common api for java and scala developers. You can see the abstract class defined here:

Re: BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

2019-05-14 Thread Jason Dai

The slides for the talks have been uploaded to https://analytics-zoo.github.io/master/#presentations/. Thanks, -Jason On Fri, Apr 19, 2019 at 9:55 PM Khare, Ankit wrote: > Thanks for sharing. > > Sent from my iPhone > > On 19. Apr 2019, at 01:35, Jason Dai wrote: > > Hi all, > > > Please see

Re: What is the difference for the following UDFs?

2019-05-14 Thread Qian He

Hi Jacek, Thanks for your reply. Your provided case was actually same as my second option in my original email. What I'm wondering was the difference between those two regarding query performance or efficiency. On Tue, May 14, 2019 at 3:51 PM Jacek Laskowski wrote: > Hi, > > For this

Re: What is the difference for the following UDFs?

2019-05-14 Thread Jacek Laskowski

Hi, For this particular case I'd use Column.substr ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column), e.g. val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e") scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show +-+ | demo| +-+

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann

Hi Suket, Anastasios Many thanks for your time and your suggestions! I tried again with various settings for the watermarks and the trigger time - watermark 20sec, trigger 2sec - watermark 10sec, trigger 1sec - watermark 20sec, trigger 0sec I also tried continuous processing mode, but since I

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-14 Thread Suket Arora

// Continuous trigger with one-second checkpointing intervaldf.writeStream .format("console") .trigger(Trigger.Continuous("1 second")) .start() On Tue, 14 May 2019 at 22:10, suket arora wrote: > Hey Austin, > > If you truly want to process as a stream, use continuous streaming in >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Suket Arora

df = inputStream.withWatermark("eventtime", "20 seconds").groupBy("sharedId", window("20 seconds", "10 seconds") // ProcessingTime trigger with two-seconds micro-batch interval df.writeStream .format("console") .trigger(Trigger.ProcessingTime("2 seconds")) .start() On Tue, 14 May

What is the difference for the following UDFs?

2019-05-14 Thread Qian He

For example, I have a dataframe with 3 columns: URL, START, END. For each url from URL column, I want to fetch a substring of it starting from START and ending at END. ++--+-+ |URL|START |END | ++--+-+

Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Jean Georges Perrin

There are a little bit more than the list you specified nevertheless, some data types are not directly compatible between Scala and Java and requires conversion, so it’s good to not pollute your code with plenty of conversion and focus on using the straight API. I don’t remember from the top

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann

Hi Anastasios On 5/14/19 4:15 PM, Anastasios Zouzias wrote: > Hi Joe, > > How often do you trigger your mini-batch? Maybe you can specify the trigger > time explicitly to a low value or even better set it off. > > See: >

Re: Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann

Hi Suket Sorry, this was a typo in the pseudo-code I sent. Of course that what you suggested (using the same eventtime attribute for the watermark and the window) is what my code does in reality. Sorry, to confuse people. On 5/14/19 4:14 PM, suket arora wrote: > Hi Joe, > As per the spark

Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Gary Gao

Hi all, I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we just use scala apis in java codes ? What's the difference ? Some examples of Java-Friendly APIs commented in Spark code are as follows: JavaDStream JavaInputDStream JavaStreamingContext JavaSparkContext

Re: Handling of watermark in structured streaming

2019-05-14 Thread Anastasios Zouzias

Hi Joe, How often do you trigger your mini-batch? Maybe you can specify the trigger time explicitly to a low value or even better set it off. See: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers Best, Anastasios On Tue, May 14, 2019 at 3:49 PM Joe

Re: Handling of watermark in structured streaming

2019-05-14 Thread suket arora

Hi Joe, As per the spark structured streaming documentation and I quote "withWatermark must be called on the same column as the timestamp column used in the aggregate. For example, df.withWatermark("time", "1 min").groupBy("time2").count() is invalid in Append output mode, as watermark is defined

Handling of watermark in structured streaming

2019-05-14 Thread Joe Ammann

Hi all I'm fairly new to Spark structured streaming and I'm only starting to develop an understanding for the watermark handling. Our application reads data from a Kafka input topic and as one of the first steps, it has to group incoming messages. Those messages come in bulks, e.g. 5 messages

Databricks - number of executors, shuffle.partitions etc

Spark job gets hung on cloudera cluster

In spark 2.4 Upgrade Apache Arrow to version 0.12.0

Re: Koalas show data in IDE or pyspark

Re: Why do we need Java-Friendly APIs in Spark ?

Re: BigDL and Analytics Zoo talks at upcoming Spark+AI Summit and Strata London

Re: What is the difference for the following UDFs?

Re: What is the difference for the following UDFs?

Re: Handling of watermark in structured streaming

Re: Structured Streaming Kafka - Weird behavior with performance and logs

Re: Handling of watermark in structured streaming

What is the difference for the following UDFs?

Re: Why do we need Java-Friendly APIs in Spark ?

Re: Handling of watermark in structured streaming

Re: Handling of watermark in structured streaming

Why do we need Java-Friendly APIs in Spark ?

Re: Handling of watermark in structured streaming

Re: Handling of watermark in structured streaming

Handling of watermark in structured streaming

19 matches

Site Navigation

Mail list logo

Footer information