Issue upgrading to Spark 2.3.1 (Maintenance Release)

2018-06-14 Thread Aakash Basu
Hi, Downloaded the latest Spark version because the of the fix for "ERROR AsyncEventQueue:70 - Dropping event from queue appStatus." After setting environment variables and running the same code in PyCharm, I'm getting this error, which I can't find a solution of. Exception in thread "main"

Re: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread 刘崇光
Hello Takuya, Thanks for your message. I will do the JIRA and PR. Best regards, Chongguang On Thu, Jun 14, 2018 at 11:25 PM, Takuya UESHIN wrote: > Hi Chongguang, > > Thanks for the report! > > That makes sense and the proposition should work, or we can add something > like `def

Re: Spark user classpath setting

2018-06-14 Thread Arjun kr
Thanks a lot, Marcelo!! It did the work. :) Regards, Arjun From: Marcelo Vanzin Sent: Friday, June 15, 2018 2:07 AM To: Arjun kr Cc: user@spark.apache.org Subject: Re: Spark user classpath setting I only know of a way to do that with YARN. You can distribute

Re: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread Takuya UESHIN
Hi Chongguang, Thanks for the report! That makes sense and the proposition should work, or we can add something like `def array_contains(column: Column, value: Column)`. Maybe other functions, such as `array_position`, `element_at`, are the same situation. Could you file a JIRA, and submit a PR

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Thank you. Let me see if I can reproduce this. We're not seeing offsets load correctly on startup - but perhaps there is an error on my side. Bryan Get Outlook for Android From: Cody Koeninger Sent: Thursday, June 14, 2018 5:01:01

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
Offsets are loaded when you instantiate an org.apache.kafka.clients.consumer.KafkaConsumer, subscribe, and poll. There's not an explicit api for it. Have you looked at the output of kafka-consumer-groups.sh and tried the example code I linked to? bash-3.2$ ./bin/kafka-consumer-groups.sh

Re: Spark user classpath setting

2018-06-14 Thread Marcelo Vanzin
I only know of a way to do that with YARN. You can distribute the jar files using "--files" and add just their names (not the full path) to the "extraClassPath" configs. You don't need "userClassPathFirst" in that case. On Thu, Jun 14, 2018 at 1:28 PM, Arjun kr wrote: > Hi All, > > > I am

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Where is that called in the driver? The only call I see from Subscribe is to load the offset from checkpoint. Get Outlook for Android From: Cody Koeninger Sent: Thursday, June 14, 2018 4:24:58 PM To: Bryan Jeffrey Cc: user Subject:

Spark user classpath setting

2018-06-14 Thread Arjun kr
Hi All, I am trying to execute a sample spark script ( that use spark jdbc ) which has dependencies on a set of custom jars. These custom jars need to be added first in the classpath. Currently, I have copied custom lib directory to all the nodes and able to execute it with below command.

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
The code that loads offsets from kafka is in e.g. org.apache.kafka.clients.consumer, it's not in spark. On Thu, Jun 14, 2018 at 3:22 PM, Bryan Jeffrey wrote: > Cody, > > Can you point me to the code that loads offsets? As far as I can see with > Spark 2.1, the only offset load is from

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Cody, Can you point me to the code that loads offsets? As far as I can see with Spark 2.1, the only offset load is from checkpoint. Thank you! Bryan Get Outlook for Android From: Cody Koeninger Sent: Thursday, June 14, 2018 4:00:31 PM

Re: Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Cody Koeninger
The expectation is that you shouldn't have to manually load offsets from kafka, because the underlying kafka consumer on the driver will start at the offsets associated with the given group id. That's the behavior I see with this example:

Kafka Offset Storage: Fetching Offsets

2018-06-14 Thread Bryan Jeffrey
Hello. I am using Spark 2.1 and Kafka 0.10.2.1 and the DStream interface. Based on the documentation ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself), it appears that you can now use Kafka itself to store offsets. I've setup a simple Kafka DStream: val

[structured-streaming][parquet] readStream files order in Parquet

2018-06-14 Thread karthikjay
My parquet files are first partitioned by environment and then by date like: env=testing/ date=2018-03-04/ part1.parquet part2.parquet part3.parquet date=2018-03-05/ part1.parquet part2.parquet part3.parquet date=2018-03-06/

Re: Using G1GC in Spark

2018-06-14 Thread Aakash Basu
Thanks a lot Srinath, for your perpetual help. On Thu, Jun 14, 2018 at 5:49 PM, Srinath C wrote: > You'll have to use "spark.executor.extraJavaOptions" configuration > parameter: > See documentation link > > . > >

Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
Next week is pride in San Francisco but I'm still going to do two quick session. One will be live coding with Apache Spark to collect ASF diversity information ( https://www.youtube.com/watch?v=OirnFnsU37A / https://www.twitch.tv/events/O1edDMkTRBGy0I0RCK-Afg ) on Monday at 9am pacific and the

Re: Using G1GC in Spark

2018-06-14 Thread Srinath C
You'll have to use "spark.executor.extraJavaOptions" configuration parameter: See documentation link . --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC" Regards, Srinath. On Thu, Jun 14, 2018 at 4:44 PM Aakash

Using G1GC in Spark

2018-06-14 Thread Aakash Basu
Hi, I am trying to spark submit with G1GC for garbage collection, but it isn't working. What is the way to deploy a spark job with G1GC? Tried - *spark-submit --master spark://192.168.60.20:7077 --conf -XX:+UseG1GC /appdata/bblite-codebase/test.py* Didn't work.

Fwd: array_contains in package org.apache.spark.sql.functions

2018-06-14 Thread 刘崇光
-- Forwarded message -- From: 刘崇光 Date: Thu, Jun 14, 2018 at 11:08 AM Subject: array_contains in package org.apache.spark.sql.functions To: user@spark.apache.org Hello all, I ran into a use case in project with spark sql and want to share with you some thoughts about the

unsubscribe

2018-06-14 Thread panda
unsubscribe Sent from YoMail

Crosstab/AproxQuantile Performance on Spark Cluster

2018-06-14 Thread Aakash Basu
Hi all, Is the Event Timeline representing a good shape? I mean at a point, to calculate WoE columns on categorical variables, I am having to do crosstab on each column, and on a cluster of 4 nodes, it is taking time as I've 230+ columns and 60,000 rows. How to make it more performant?