Spark streaming filling the disk with logs

2019-02-13 Thread Deepak Sharma
Hi All I am running a spark streaming job with below configuration : --conf "spark.executor.extraJavaOptions=-Droot.logger=WARN,console" But it’s still filling the disk with info logs. If the logging level is set to WARN at cluster level , then only the WARN logs are getting written but then it

Re: "where" clause able to access fields not in its schema

2019-02-13 Thread Yeikel
It seems that we are using the function incorrectly. val a = Seq((1,10),(2,20)).toDF("foo","bar") val b = a.select($"foo") val c = b.where(b("bar") === 20) c.show Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "bar" among

Re: Pyspark elementwise matrix multiplication

2019-02-13 Thread Yeikel
Elementwise product is described here : https://spark.apache.org/docs/latest/mllib-feature-extraction.html#elementwiseproduct I don't know if it will work with your input thought. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: SparkR + binary type + how to get value

2019-02-13 Thread Felix Cheung
Please share your code From: Thijs Haarhuis Sent: Wednesday, February 13, 2019 6:09 AM To: user@spark.apache.org Subject: SparkR + binary type + how to get value Hi all, Does anybody have any experience in accessing the data from a column which has a binary

Re: Got fatal error when running spark 2.4.0 on k8s

2019-02-13 Thread dawn breaks
It seems that fabric8 kubernetes client can't parse the caCertFile in the default location /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, and anybody give me some advices? On Wed, 13 Feb 2019 at 16:21, dawn breaks <2005dawnbre...@gmail.com> wrote: > we submit spark job to k8s by the

Re: "where" clause able to access fields not in its schema

2019-02-13 Thread Vadim Semenov
Yeah, the filter gets infront of the select after analyzing scala> b.where($"bar" === 20).explain(true) == Parsed Logical Plan == 'Filter ('bar = 20) +- AnalysisBarrier +- Project [foo#6] +- Project [_1#3 AS foo#6, _2#4 AS bar#7] +- SerializeFromObject

Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
Hmm, I’m not asking about using k8s to control Spark as a Job manager or scheduler like Yarn. We use the built-in standalone Spark Job Manager and sparl://spark-api:7077 as the master not k8s. The problem is using k8s to manage a cluster consisting of our app, some databases, and Spark (one

Re: "where" clause able to access fields not in its schema

2019-02-13 Thread Yeikel
This is indeed strange. To add to the question , I can see that if I use a filter I get an exception (as expected) , so I am not sure what's the difference between the where clause and filter : b.filter(s=> { val bar : String = s.getAs("bar") bar.equals("20") }).show *

"where" clause able to access fields not in its schema

2019-02-13 Thread Alex Nastetsky
I don't know if this is a bug or a feature, but it's a bit counter-intuitive when reading code. The "b" dataframe does not have field "bar" in its schema, but is still able to filter on that field. scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar") a:

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

2019-02-13 Thread Chris Teoh
Thanks Peter. I'm not sure if that is possible yet. The closest I can think of to achieving what you want is to try something like:- df.registerTempTable("mytable") sql("create table mymanagedtable as select * from mytable") I haven't used CTAS in Spark SQL before but have heard it works. This

Stage or Tasks level logs missing

2019-02-13 Thread Nirav Patel
Currently there seems to be 3 places to check task level logs: 1) Using spark UI 2) `yarn application log` 3) log aggregation on hdfs (if enabled) All above only give you log at executor(container) level. However one executor can have multiple threads and each might be running part of different

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

2019-02-13 Thread Horváth Péter Gergely
Hi Chris, Thank you for the input, I know I can always write the table DDL manually. But here I would like to rely on Spark generating the schema. What I don't understand is the change in the behaviour of Spark: having the storage path specified does not necessarily mean it should be an external

Design recommendation

2019-02-13 Thread Kumar sp
Hello I need a design recommendation. I need to calcualte a couple of calculations with min shuffling and better perf. I have an nested structure with say a class have n number of students and structure will be similiar to this { classId: String, StudendId:String, Score:Int, AreaCode:String}

[no subject]

2019-02-13 Thread Kumar sp

SparkR + binary type + how to get value

2019-02-13 Thread Thijs Haarhuis
Hi all, Does anybody have any experience in accessing the data from a column which has a binary type in a Spark Data Frame in R? I have a Spark Data Frame which has a column which is of a binary type. I want to access this data and process it. In my case I collect the spark data frame to a R

RE: Got fatal error when running spark 2.4.0 on k8s

2019-02-13 Thread Sinha, Breeta (Nokia - IN/Bangalore)
Hi Dawn, Probably, you are providing the incorrect image(must be a java image) or the incorrect master ip or the service account. Please verify the pod’s permissions for the service account(‘spark’ in your case). I have tried executing the same program as below: ./spark-submit --master

Re: Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

2019-02-13 Thread Chris Teoh
Hey there, Could you not just create a managed table using the DDL in Spark SQL and then written the data frame to the underlying folder or use Spark SQL to do an insert? Alternatively try create table as select. Iirc hive creates managed tables this way. I've not confirmed this works but I

Spark2 DataFrameWriter.saveAsTable defaults to external table if path is provided

2019-02-13 Thread Horváth Péter Gergely
Dear All, I am facing a strange issue with Spark 2.3, where I would like to create a MANAGED table out of the content of a DataFrame with the storage path overridden. Apparently, when one tries to create a Hive table via DataFrameWriter.saveAsTable, supplying a "path" option causes Spark to

Subscribe

2019-02-13 Thread Rafael Mendes

Re: Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2

2019-02-13 Thread Jungtaek Lim
Adding to Gabor's answer, in Spark 3.0 end users can even provide full of group id (Please refer SPARK-26350 [1]), but you may feel more convenient to use prefix of group id Gabor guided (Please refer SPARK-26121 [2]) to provide permission to broader ranges of groups. 1.

Re: Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Not authorized to access group: spark-kafka-source-060f3ceb-09f4-4e28-8210-3ef8a845fc92--2038748645-driver-2

2019-02-13 Thread Gabor Somogyi
Hi Thomas, The issue occurs when the user does not have the READ permission on the consumer groups. In DStreams group ID is configured in application, for example:

Got fatal error when running spark 2.4.0 on k8s

2019-02-13 Thread dawn breaks
we submit spark job to k8s by the following command, and the driver pod got an error and exit. Anybody can help us to solve it? ./bin/spark-submit \ --master k8s://https://172.21.91.48:6443 \ --deploy-mode cluster \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark