the jdbc url is invalid, but strangely it should have thrown ORA- exception
On Mon, Aug 28, 2017 at 4:55 PM, Naga G wrote:
> Not able to find the database name.
> ora is the database in the below url ?
>
> Sent from Naga iPad
>
> > On Aug 28, 2017, at 4:06 AM, Imran Rajjad
You can create a UDF which will invoke your java lib
def calculateExpense: UserDefinedFunction = udf((pexpense: String,
cexpense: String) => new MyJava().calculateExpense(pexpense.toDouble,
cexpense.toDouble))
On Tue, Aug 29, 2017 at 6:53 AM, purna pradeep
wrote:
>
I have data in a DataFrame with below columns
1)Fileformat is csv
2)All below column datatypes are String
employeeid,pexpense,cexpense
Now I need to create a new DataFrame which has new column called `expense`,
which is calculated based on columns `pexpense`, `cexpense`.
The tricky part is
Hi Cody,
Following is the way that I am consuming data for a 60 second batch. Do you
see anything that is wrong with the way the data is getting consumed that
can cause slowness in performance?
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> kafkaBrokers,
Thanks for responding BUT I would not be reading from a file if it was
Hive.
I'm comparing Hive LLAP from a hive table vs Spark SQL from a file. That
is the question.
Thanks
On Mon, Aug 28, 2017 at 1:58 PM, Imran Rajjad wrote:
> If reading directly from file then Spark SQL
I'm still unclear on if orderBy/groupBy + aggregates is a viable approach
or when one could rely on the last or first aggregate functions, but a
working alternative is to use window functions with row_number and a filter
kind of like this:
import spark.implicits._
val reverseOrdering = Seq("a",
Thanks, Vadim. The issue is not access to logs. I am able to view them.
I have cloudwatch logs agent push logs to elasticsearch and then into Kibana
using json-event-layout for log4j output. I would like to also log
applicationId, executorId, etc in those log statements for clarity. Is there an
When you create a EMR cluster you can specify a S3 path where logs will be
saved after cluster, something like this:
s3://bucket/j-18ASDKLJLAKSD/containers/application_1494074597524_0001/container_1494074597524_0001_01_01/stderr.gz
Does anyone have a working solution for logging YARN application id, YARN
container hostname, Executor ID and YARN attempt for jobs running on Spark EMR
5.7.0 in log statements? Are there specific ENV variables available or other
workflow for doing that?
Thank you
Alex
I am running a spark streaming application on a cluster composed by three
nodes, each one with a worker and three executors (so a total of 9
executors). I am using the spark standalone mode.
The application is run with a spark-submit command with option --deploy-mode
client. The submit command is
There is no difference in performance even with Cache being enabled.
On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> There is no difference in performance even with Cache being disabled.
>
> On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger
ok . i see there is a describe() function which does the stat calculation
on dataset similar to StatCounter but however i dont want to restrict my
aggregations to standard mean, stddev etc and generate some custom stats ,
or also may not run all the predefined stats but only subset of them on the
I didn't tailor it to your needs, but this is what I can offer you, the
idea should be pretty clear
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{collect_list, struct}
val spark: SparkSession
import spark.implicits._
case class Input(
a: Int,
b: Long,
c:
Rdd only
Patrick schrieb am Mo. 28. Aug. 2017 um 20:13:
> Ah, does it work with Dataset API or i need to convert it to RDD first ?
>
> On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler
> wrote:
>
>> What about the rdd stat counter?
>>
Thanks Sam – this might be the solution. I will investigate!
From: Sam Elamin [mailto:hussam.ela...@gmail.com]
Sent: Monday, August 28, 2017 1:14 PM
To: JG Perrin
Cc: user@spark.apache.org
Subject: Re: from_json()
Hi jg,
Perhaps I am misunderstanding you, but if you just
There is no difference in performance even with Cache being disabled.
On Mon, Aug 28, 2017 at 7:43 AM, Cody Koeninger wrote:
> So if you can run with cache enabled for some time, does that
> significantly affect the performance issue you were seeing?
>
> Those settings seem
Hi jg,
Perhaps I am misunderstanding you, but if you just want to create a new
schema from a df its fairly simple, assuming you have a schema already
predefined or in a string. i.e.
val newSchema = DataType.fromJson(json_schema_string)
then all you need to do is re-create the dataframe using
Ah, does it work with Dataset API or i need to convert it to RDD first ?
On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler
wrote:
> What about the rdd stat counter? https://spark.apache.org/docs/
> 0.6.2/api/core/spark/util/StatCounter.html
>
> Patrick
If reading directly from file then Spark SQL should be your choice
On Mon, Aug 28, 2017 at 10:25 PM Michael Artz
wrote:
> Just to be clear, I'm referring to having Spark reading from a file, not
> from a Hive table. And it will have tungsten engine off heap
Just to be clear, I'm referring to having Spark reading from a file, not
from a Hive table. And it will have tungsten engine off heap serialization
after 2.1, so if it was a test with like 1.63 it won't be as helpful.
On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz
What about the rdd stat counter?
https://spark.apache.org/docs/0.6.2/api/core/spark/util/StatCounter.html
Patrick schrieb am Mo. 28. Aug. 2017 um 16:47:
> Hi
>
> I have two lists:
>
>
>- List one: contains names of columns on which I want to do aggregate
>
Is there a way to not have to specify a schema when using from_json() or infer
the schema? When you read a JSON doc from disk, you can infer the schema.
Should I write it to disk before (ouch)?
jg
__
This electronic
I had some similar problems with the Simba driver before. I was using the
ODBC one, but make sure your config looks like this page.
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/spark/simbaOdbcDriverConfigWindows.html
Notice selecting the authorization mechanism of
Hi,
I'm struggling a little with some unintuitive behavior with the Scala API.
(Spark 2.0.2)
I wrote something like
df.orderBy("a", "b")
.groupBy("group_id")
.agg(sum("col_to_sum").as("total"),
last("row_id").as("last_row_id")))
and expected a result with a unique group_id column, a
Hey Mike,
You need to do it yourself, it’s really easy:
http://spark.apache.org/community.html.
hih
jg
From: Michael Artz [mailto:michaelea...@gmail.com]
Sent: Monday, August 28, 2017 7:43 AM
To: user@spark.apache.org
Subject: add me to email list
Hi,
Please add me to the email list
Mike
Hi,
There isn't any good source to answer the question if Hive as an
SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if
there has been a comparison done lately for HiveQL vs Spark SQL on Spark
versions 2.1 or later. I have a large ETL process, with many table joins
and
Hi
I have two lists:
- List one: contains names of columns on which I want to do aggregate
operations.
- List two: contains the aggregate operations on which I want to perform
on each column eg ( min, max, mean)
I am trying to use spark 2.0 dataset to achieve this. Spark provides
So if you can run with cache enabled for some time, does that
significantly affect the performance issue you were seeing?
Those settings seem reasonable enough. If preferred locations is
behaving correctly you shouldn't need cached consumers for all 96
partitions on any one executor, so that
1. No, prefetched message offsets aren't exposed.
2. No, I'm not aware of any plans for sync commit, and I'm not sure
that makes sense. You have to be able to deal with repeat messages in
the event of failure in any case, so the only difference sync commit
would make would be (possibly) slower
Hi all,
I've been looking heavily into Spark 2.2 to solve a problem I have by
specifically using mapGroupsWithState. What I've discovered is that a
*groupBy(window(..))* does not work when being used with a subsequent
*mapGroupsWithState* and produces an AnalysisException of :
Hi everyone,
I am trying to improve the performance of data loading from disk. For that
I have implemented my own RDD and now I am trying to increase the
performance with predicate pushdown.
I have used many sources including the documentations and
Hi,
As I am playing with structured streaming, I observed that window function
always requires a time column in input data.So that means it's event time.
Is it possible to old spark streaming style window function based on
processing time. I don't see any documentation on the same.
--
Regards,
Hi,
Please add me to the email list
Mike
Not able to find the database name.
ora is the database in the below url ?
Sent from Naga iPad
> On Aug 28, 2017, at 4:06 AM, Imran Rajjad wrote:
>
> Hello,
>
> I am trying to retrieve an oracle table into Dataset using following code
>
> String url =
Hello,
I am trying to retrieve an oracle table into Dataset using following
code
String url = "jdbc:oracle@localhost:1521:ora";
Dataset jdbcDF = spark.read()
.format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("url", url)
.option("dbtable",
35 matches
Mail list logo