date:20170828

Re: Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad

the jdbc url is invalid, but strangely it should have thrown ORA- exception On Mon, Aug 28, 2017 at 4:55 PM, Naga G wrote: > Not able to find the database name. > ora is the database in the below url ? > > Sent from Naga iPad > > > On Aug 28, 2017, at 4:06 AM, Imran Rajjad

Re: use WithColumn with external function in a java jar

2017-08-28 Thread Praneeth Gayam

You can create a UDF which will invoke your java lib def calculateExpense: UserDefinedFunction = udf((pexpense: String, cexpense: String) => new MyJava().calculateExpense(pexpense.toDouble, cexpense.toDouble)) On Tue, Aug 29, 2017 at 6:53 AM, purna pradeep wrote: >

use WithColumn with external function in a java jar

2017-08-28 Thread purna pradeep

I have data in a DataFrame with below columns 1)Fileformat is csv 2)All below column datatypes are String employeeid,pexpense,cexpense Now I need to create a new DataFrame which has new column called `expense`, which is calculated based on columns `pexpense`, `cexpense`. The tricky part is

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy

Hi Cody, Following is the way that I am consuming data for a 60 second batch. Do you see anything that is wrong with the way the data is getting consumed that can cause slowness in performance? val kafkaParams = Map[String, Object]( "bootstrap.servers" -> kafkaBrokers,

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz

Thanks for responding BUT I would not be reading from a file if it was Hive. I'm comparing Hive LLAP from a hive table vs Spark SQL from a file. That is the question. Thanks On Mon, Aug 28, 2017 at 1:58 PM, Imran Rajjad wrote: > If reading directly from file then Spark SQL

Re: Help taking last value in each group (aggregates)

2017-08-28 Thread Everett Anderson

I'm still unclear on if orderBy/groupBy + aggregates is a viable approach or when one could rely on the last or first aggregate functions, but a working alternative is to use window functions with row_number and a filter kind of like this: import spark.implicits._ val reverseOrdering = Seq("a",

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex

Thanks, Vadim. The issue is not access to logs. I am able to view them. I have cloudwatch logs agent push logs to elasticsearch and then into Kibana using json-event-layout for log4j output. I would like to also log applicationId, executorId, etc in those log statements for clarity. Is there an

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Vadim Semenov

When you create a EMR cluster you can specify a S3 path where logs will be saved after cluster, something like this: s3://bucket/j-18ASDKLJLAKSD/containers/application_1494074597524_0001/container_1494074597524_0001_01_01/stderr.gz

Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex

Does anyone have a working solution for logging YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements? Are there specific ENV variables available or other workflow for doing that? Thank you Alex

[Spark Streaming] Application is stopped after stopping a worker

2017-08-28 Thread Davide.Mandrini

I am running a spark streaming application on a cluster composed by three nodes, each one with a worker and three executors (so a total of 9 executors). I am using the spark standalone mode. The application is run with a spark-submit command with option --deploy-mode client. The submit command is

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy

There is no difference in performance even with Cache being enabled. On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > There is no difference in performance even with Cache being disabled. > > On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick

ok . i see there is a describe() function which does the stat calculation on dataset similar to StatCounter but however i dont want to restrict my aggregations to standard mean, stddev etc and generate some custom stats , or also may not run all the predefined stats but only subset of them on the

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Vadim Semenov

I didn't tailor it to your needs, but this is what I can offer you, the idea should be pretty clear import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.{collect_list, struct} val spark: SparkSession import spark.implicits._ case class Input( a: Int, b: Long, c:

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Georg Heiler

Rdd only Patrick schrieb am Mo. 28. Aug. 2017 um 20:13: > Ah, does it work with Dataset API or i need to convert it to RDD first ? > > On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler > wrote: > >> What about the rdd stat counter? >>

RE: from_json()

2017-08-28 Thread JG Perrin

Thanks Sam – this might be the solution. I will investigate! From: Sam Elamin [mailto:hussam.ela...@gmail.com] Sent: Monday, August 28, 2017 1:14 PM To: JG Perrin Cc: user@spark.apache.org Subject: Re: from_json() Hi jg, Perhaps I am misunderstanding you, but if you just

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy

There is no difference in performance even with Cache being disabled. On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger wrote: > So if you can run with cache enabled for some time, does that > significantly affect the performance issue you were seeing? > > Those settings seem

Re: from_json()

2017-08-28 Thread Sam Elamin

Hi jg, Perhaps I am misunderstanding you, but if you just want to create a new schema from a df its fairly simple, assuming you have a schema already predefined or in a string. i.e. val newSchema = DataType.fromJson(json_schema_string) then all you need to do is re-create the dataframe using

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick

Ah, does it work with Dataset API or i need to convert it to RDD first ? On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler wrote: > What about the rdd stat counter? https://spark.apache.org/docs/ > 0.6.2/api/core/spark/util/StatCounter.html > > Patrick

Re: Spark SQL vs HiveQL

2017-08-28 Thread Imran Rajjad

If reading directly from file then Spark SQL should be your choice On Mon, Aug 28, 2017 at 10:25 PM Michael Artz wrote: > Just to be clear, I'm referring to having Spark reading from a file, not > from a Hive table. And it will have tungsten engine off heap

Re: Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz

Just to be clear, I'm referring to having Spark reading from a file, not from a Hive table. And it will have tungsten engine off heap serialization after 2.1, so if it was a test with like 1.63 it won't be as helpful. On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Georg Heiler

What about the rdd stat counter? https://spark.apache.org/docs/0.6.2/api/core/spark/util/StatCounter.html Patrick schrieb am Mo. 28. Aug. 2017 um 16:47: > Hi > > I have two lists: > > >- List one: contains names of columns on which I want to do aggregate >

from_json()

2017-08-28 Thread JG Perrin

Is there a way to not have to specify a schema when using from_json() or infer the schema? When you read a JSON doc from disk, you can infer the schema. Should I write it to disk before (ouch)? jg __ This electronic

Re: apache thrift server

2017-08-28 Thread MidwestMike

I had some similar problems with the Simba driver before. I was using the ODBC one, but make sure your config looks like this page. https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/spark/simbaOdbcDriverConfigWindows.html Notice selecting the authorization mechanism of

Help taking last value in each group (aggregates)

2017-08-28 Thread Everett Anderson

Hi, I'm struggling a little with some unintuitive behavior with the Scala API. (Spark 2.0.2) I wrote something like df.orderBy("a", "b") .groupBy("group_id") .agg(sum("col_to_sum").as("total"), last("row_id").as("last_row_id"))) and expected a result with a unique group_id column, a

RE: add me to email list

2017-08-28 Thread JG Perrin

Hey Mike, You need to do it yourself, it’s really easy: http://spark.apache.org/community.html. hih jg From: Michael Artz [mailto:michaelea...@gmail.com] Sent: Monday, August 28, 2017 7:43 AM To: user@spark.apache.org Subject: add me to email list Hi, Please add me to the email list Mike

Spark SQL vs HiveQL

2017-08-28 Thread Michael Artz

Hi, There isn't any good source to answer the question if Hive as an SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if there has been a comparison done lately for HiveQL vs Spark SQL on Spark versions 2.1 or later. I have a large ETL process, with many table joins and

Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick

Hi I have two lists: - List one: contains names of columns on which I want to do aggregate operations. - List two: contains the aggregate operations on which I want to perform on each column eg ( min, max, mean) I am trying to use spark 2.0 dataset to achieve this. Spark provides

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread Cody Koeninger

So if you can run with cache enabled for some time, does that significantly affect the performance issue you were seeing? Those settings seem reasonable enough. If preferred locations is behaving correctly you shouldn't need cached consumers for all 96 partitions on any one executor, so that

Re: Kafka Consumer Pre Fetch Messages + Async commits

2017-08-28 Thread Cody Koeninger

1. No, prefetched message offsets aren't exposed. 2. No, I'm not aware of any plans for sync commit, and I'm not sure that makes sense. You have to be able to deal with repeat messages in the event of failure in any case, so the only difference sync commit would make would be (possibly) slower

Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-08-28 Thread daniel williams

Hi all, I've been looking heavily into Spark 2.2 to solve a problem I have by specifically using mapGroupsWithState. What I've discovered is that a *groupBy(window(..))* does not work when being used with a subsequent *mapGroupsWithState* and produces an AnalysisException of :

Predicate Pushdown Doesn't Work With Data Source API

2017-08-28 Thread Anton Puzanov

Hi everyone, I am trying to improve the performance of data loading from disk. For that I have implemented my own RDD and now I am trying to increase the performance with predicate pushdown. I have used many sources including the documentations and

Time window on Processing Time

2017-08-28 Thread madhu phatak

Hi, As I am playing with structured streaming, I observed that window function always requires a time column in input data.So that means it's event time. Is it possible to old spark streaming style window function based on processing time. I don't see any documentation on the same. -- Regards,

add me to email list

2017-08-28 Thread Michael Artz

Hi, Please add me to the email list Mike

Re: Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Naga G

Not able to find the database name. ora is the database in the below url ? Sent from Naga iPad > On Aug 28, 2017, at 4:06 AM, Imran Rajjad wrote: > > Hello, > > I am trying to retrieve an oracle table into Dataset using following code > > String url =

Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad

Hello, I am trying to retrieve an oracle table into Dataset using following code String url = "jdbc:oracle@localhost:1521:ora"; Dataset jdbcDF = spark.read() .format("jdbc") .option("driver", "oracle.jdbc.driver.OracleDriver") .option("url", url) .option("dbtable",

Re: Oracle Table not resolved [Spark 2.1.1]

Re: use WithColumn with external function in a java jar

use WithColumn with external function in a java jar

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

Re: Spark SQL vs HiveQL

Re: Help taking last value in each group (aggregates)

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

[Spark Streaming] Application is stopped after stopping a worker

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

RE: from_json()

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

Re: from_json()

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

Re: Spark SQL vs HiveQL

Re: Spark SQL vs HiveQL

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

from_json()

Re: apache thrift server

Help taking last value in each group (aggregates)

RE: add me to email list

Spark SQL vs HiveQL

Collecting Multiple Aggregation query result on one Column as collectAsMap

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

Re: Kafka Consumer Pre Fetch Messages + Async commits

Spark 2.2 structured streaming with mapGroupsWithState + window functions

Predicate Pushdown Doesn't Work With Data Source API

Time window on Processing Time

add me to email list

Re: Oracle Table not resolved [Spark 2.1.1]

Oracle Table not resolved [Spark 2.1.1]

35 matches

Site Navigation

Mail list logo

Footer information