Re: Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad
the jdbc url is invalid, but strangely it should have thrown ORA- exception On Mon, Aug 28, 2017 at 4:55 PM, Naga G wrote: > Not able to find the database name. > ora is the database in the below url ? > > Sent from Naga iPad > > > On Aug 28, 2017, at 4:06 AM, Imran Rajjad

Re: use WithColumn with external function in a java jar

2017-08-28 Thread Praneeth Gayam
You can create a UDF which will invoke your java lib def calculateExpense: UserDefinedFunction = udf((pexpense: String, cexpense: String) => new MyJava().calculateExpense(pexpense.toDouble, cexpense.toDouble)) On Tue, Aug 29, 2017 at 6:53 AM, purna pradeep wrote: >

use WithColumn with external function in a java jar

2017-08-28 Thread purna pradeep
I have data in a DataFrame with below columns 1)Fileformat is csv 2)All below column datatypes are String employeeid,pexpense,cexpense Now I need to create a new DataFrame which has new column called `expense`, which is calculated based on columns `pexpense`, `cexpense`. The tricky part is

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
Hi Cody, Following is the way that I am consuming data for a 60 second batch. Do you see anything that is wrong with the way the data is getting consumed that can cause slowness in performance? val kafkaParams = Map[String, Object]( "bootstrap.servers" -> kafkaBrokers,

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Thanks for responding BUT I would not be reading from a file if it was Hive. I'm comparing Hive LLAP from a hive table vs Spark SQL from a file. That is the question. Thanks On Mon, Aug 28, 2017 at 1:58 PM, Imran Rajjad wrote: > If reading directly from file then Spark SQL

Re: Help taking last value in each group (aggregates)

2017-08-28 Thread Everett Anderson
I'm still unclear on if orderBy/groupBy + aggregates is a viable approach or when one could rely on the last or first aggregate functions, but a working alternative is to use window functions with row_number and a filter kind of like this: import spark.implicits._ val reverseOrdering = Seq("a",

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex
Thanks, Vadim. The issue is not access to logs. I am able to view them. I have cloudwatch logs agent push logs to elasticsearch and then into Kibana using json-event-layout for log4j output. I would like to also log applicationId, executorId, etc in those log statements for clarity. Is there an

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Vadim Semenov
When you create a EMR cluster you can specify a S3 path where logs will be saved after cluster, something like this: s3://bucket/j-18ASDKLJLAKSD/containers/application_1494074597524_0001/container_1494074597524_0001_01_01/stderr.gz

Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex
Does anyone have a working solution for logging YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements? Are there specific ENV variables available or other workflow for doing that? Thank you Alex

[Spark Streaming] Application is stopped after stopping a worker

2017-08-28 Thread Davide.Mandrini
I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode. The application is run with a spark-submit command with option --deploy-mode client. The submit command is

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
There is no difference in performance even with Cache being enabled. On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > There is no difference in performance even with Cache being disabled. > > On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
ok . i see there is a describe() function which does the stat calculation on dataset similar to StatCounter but however i dont want to restrict my aggregations to standard mean, stddev etc and generate some custom stats , or also may not run all the predefined stats but only subset of them on the

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Vadim Semenov
I didn't tailor it to your needs, but this is what I can offer you, the idea should be pretty clear import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.{collect_list, struct} val spark: SparkSession import spark.implicits._ case class Input( a: Int, b: Long, c:

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Georg Heiler
Rdd only Patrick schrieb am Mo. 28. Aug. 2017 um 20:13: > Ah, does it work with Dataset API or i need to convert it to RDD first ? > > On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler > wrote: > >> What about the rdd stat counter? >>

RE: from_json()

2017-08-28 Thread JG Perrin
Thanks Sam – this might be the solution. I will investigate! From: Sam Elamin [mailto:hussam.ela...@gmail.com] Sent: Monday, August 28, 2017 1:14 PM To: JG Perrin Cc: user@spark.apache.org Subject: Re: from_json() Hi jg, Perhaps I am misunderstanding you, but if you just

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
There is no difference in performance even with Cache being disabled. On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger wrote: > So if you can run with cache enabled for some time, does that > significantly affect the performance issue you were seeing? > > Those settings seem

Re: from_json()

2017-08-28 Thread Sam Elamin
Hi jg, Perhaps I am misunderstanding you, but if you just want to create a new schema from a df its fairly simple, assuming you have a schema already predefined or in a string. i.e. val newSchema = DataType.fromJson(json_schema_string) then all you need to do is re-create the dataframe using

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Ah, does it work with Dataset API or i need to convert it to RDD first ? On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler wrote: > What about the rdd stat counter? https://spark.apache.org/docs/ > 0.6.2/api/core/spark/util/StatCounter.html > > Patrick

Re: Spark SQL vs HiveQL

2017-08-28 Thread Imran Rajjad
If reading directly from file then Spark SQL should be your choice On Mon, Aug 28, 2017 at 10:25 PM Michael Artz wrote: > Just to be clear, I'm referring to having Spark reading from a file, not > from a Hive table. And it will have tungsten engine off heap

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Just to be clear, I'm referring to having Spark reading from a file, not from a Hive table. And it will have tungsten engine off heap serialization after 2.1, so if it was a test with like 1.63 it won't be as helpful. On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Georg Heiler
What about the rdd stat counter? https://spark.apache.org/docs/0.6.2/api/core/spark/util/StatCounter.html Patrick schrieb am Mo. 28. Aug. 2017 um 16:47: > Hi > > I have two lists: > > >- List one: contains names of columns on which I want to do aggregate >

from_json()

2017-08-28 Thread JG Perrin
Is there a way to not have to specify a schema when using from_json() or infer the schema? When you read a JSON doc from disk, you can infer the schema. Should I write it to disk before (ouch)? jg __ This electronic

Re: apache thrift server

2017-08-28 Thread MidwestMike
I had some similar problems with the Simba driver before. I was using the ODBC one, but make sure your config looks like this page. https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/spark/simbaOdbcDriverConfigWindows.html Notice selecting the authorization mechanism of

Help taking last value in each group (aggregates)

2017-08-28 Thread Everett Anderson
Hi, I'm struggling a little with some unintuitive behavior with the Scala API. (Spark 2.0.2) I wrote something like df.orderBy("a", "b") .groupBy("group_id") .agg(sum("col_to_sum").as("total"), last("row_id").as("last_row_id"))) and expected a result with a unique group_id column, a

RE: add me to email list

2017-08-28 Thread JG Perrin
Hey Mike, You need to do it yourself, it’s really easy: http://spark.apache.org/community.html. hih jg From: Michael Artz [mailto:michaelea...@gmail.com] Sent: Monday, August 28, 2017 7:43 AM To: user@spark.apache.org Subject: add me to email list Hi, Please add me to the email list Mike

Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz
Hi, There isn't any good source to answer the question if Hive as an SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if there has been a comparison done lately for HiveQL vs Spark SQL on Spark versions 2.1 or later. I have a large ETL process, with many table joins and

Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Hi I have two lists: - List one: contains names of columns on which I want to do aggregate operations. - List two: contains the aggregate operations on which I want to perform on each column eg ( min, max, mean) I am trying to use spark 2.0 dataset to achieve this. Spark provides

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread Cody Koeninger
So if you can run with cache enabled for some time, does that significantly affect the performance issue you were seeing? Those settings seem reasonable enough. If preferred locations is behaving correctly you shouldn't need cached consumers for all 96 partitions on any one executor, so that

Re: Kafka Consumer Pre Fetch Messages + Async commits

2017-08-28 Thread Cody Koeninger
1. No, prefetched message offsets aren't exposed. 2. No, I'm not aware of any plans for sync commit, and I'm not sure that makes sense. You have to be able to deal with repeat messages in the event of failure in any case, so the only difference sync commit would make would be (possibly) slower

Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-08-28 Thread daniel williams
Hi all, I've been looking heavily into Spark 2.2 to solve a problem I have by specifically using mapGroupsWithState. What I've discovered is that a *groupBy(window(..))* does not work when being used with a subsequent *mapGroupsWithState* and produces an AnalysisException of :

Predicate Pushdown Doesn't Work With Data Source API

2017-08-28 Thread Anton Puzanov
Hi everyone, I am trying to improve the performance of data loading from disk. For that I have implemented my own RDD and now I am trying to increase the performance with predicate pushdown. I have used many sources including the documentations and

Time window on Processing Time

2017-08-28 Thread madhu phatak
Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards,

add me to email list

2017-08-28 Thread Michael Artz
Hi, Please add me to the email list Mike

Re: Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Naga G
Not able to find the database name. ora is the database in the below url ? Sent from Naga iPad > On Aug 28, 2017, at 4:06 AM, Imran Rajjad wrote: > > Hello, > > I am trying to retrieve an oracle table into Dataset using following code > > String url =

Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad
Hello, I am trying to retrieve an oracle table into Dataset using following code String url = "jdbc:oracle@localhost:1521:ora"; Dataset jdbcDF = spark.read() .format("jdbc") .option("driver", "oracle.jdbc.driver.OracleDriver") .option("url", url) .option("dbtable",