Re: Static partitioning in partitionBy()

2019-05-07 Thread Felix Cheung
You could df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save It could get some data skew problem but might work for you From: Burak Yavuz Sent: Tuesday, May 7, 2019 9:35:10 AM To: Shubham Chaurasia Cc: dev; user@spark.apache.org Subject: Re: Static

Create table from Avro-generated parquet files?

2019-05-07 Thread Coolbeth, Matthew
I have a “directory” in S3 containing Parquet files created from Avro using the AvroParquetWriter in the parquet-mr project. I can load the contents of these files as a DataFrame using val it = spark.read.parquet("s3a://coolbeth/file=testy") but I have not found a way to define a permanent

Re: Static partitioning in partitionBy()

2019-05-07 Thread Burak Yavuz
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia

Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-07 Thread Austin Weaver
Hey Spark Experts, After listening to some of you, and the presentations at Spark Summit in SF, I am transitioning from d-streams to structured streaming however I am seeing some weird results. My use case is as follows: I am reading in a stream from a kafka topic, transforming a message, and

Static partitioning in partitionBy()

2019-05-07 Thread Shubham Chaurasia
Hi All, Is there a way I can provide static partitions in partitionBy()? Like: df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save Above code gives following error as it tries to find column `c=c1` in df. org.apache.spark.sql.AnalysisException: Partition column `c=c1`

ThriftServer gc over exceed and memory problem

2019-05-07 Thread shicheng31...@gmail.com
Hi all: My spark's version is 2.3.2. I start thriftserver with default spark config. On another hand, I use java-application to query result via JDBC . The query application has plenty of statement to execute. The previous statement executes very quickly, and the latter statement

Re: Dynamic metric names

2019-05-07 Thread Roberto Coluccio
It would be a dream to have an easy-to-use dynamic metric system AND a reliable counting system (accumulator-like) in Spark... Thanks Roberto On Tue, May 7, 2019 at 3:54 AM Saisai Shao wrote: > I think the main reason why that was not merged is that Spark itself > doesn't have such

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Jacek Laskowski
Hi, I'm curious about "I found the bug code". Can you point me at it? Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Xilang Yan
Ok... I am sure it is a bug of spark, I found the bug code, but the code is removed in 2.2.3, so I just upgrade spark to fix the problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Spark structured streaming watermarks on nested attributes

2019-05-07 Thread Joe Ammann
Hi Yuanjian On 5/7/19 4:55 AM, Yuanjian Li wrote: > Hi Joe > > I think you met this issue: https://issues.apache.org/jira/browse/SPARK-27340 > You can check the description in Jira and PR. We also met this in our > production env and fixed by the providing PR. > > The PR is still in review. cc