Hi ,I think it is related to this issue [Adaptive execution in Spark]
https://issues.apache.org/jira/browse/SPARK-9850
I will learn more about it.
--发件人:梅西0247
发送时间:2016年6月21日(星期二) 10:31收件人:Mich Talebzadeh
To Yong Zhang:Yes, a broadcast join hint works. But it is not what I
want.Sometimes the result is really too big to cast a broadcast on it. What I
want is a more adaptive implementation.
--发件人:Yong
Zhang
What operations did you run in the Spark shell ?
It would be easier for other people to reproduce using your code snippet.
Thanks
On Mon, Jun 20, 2016 at 6:20 PM, Jeff Zhang wrote:
> Could you check the yarn app logs for details ? run command "yarn logs
> -applicationId " to
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the
latter seems to be close to having support for Spark 2.X.
While I am interested in using Spark Notebook as soon as that support were
available are there alternatives that work *now*? For example some
unmerged -yet -working
Could you check the yarn app logs for details ? run command "yarn logs
-applicationId " to get the yarn log
On Tue, Jun 21, 2016 at 9:18 AM, wgtmac wrote:
> I ran into problems in building Spark 2.0. The build process actually
> succeeded but when I uploaded to cluster and
I ran into problems in building Spark 2.0. The build process actually
succeeded but when I uploaded to cluster and launched the Spark shell on
YARN, it reported following exceptions again and again:
16/06/17 03:32:00 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
Container marked as
You have to create input1Pair inside foreachRDD function. As your original
code, one solution will look like as following:
val input2Pair = input2.map(x => (x._1, x))
input2Pair.cache()
streamData.foreachRDD{rdd =>
if(!rdd.isEmpty()){
val input1Pair = rdd.map(x => (x._1, x))
val joinData
We are intermittently getting this error when spark tried to load data from
S3:Caused by: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target.
Hello,
We are also experiencing the same error. Can you please provide the steps
that resolved the issue.
Thanks
Satya
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p27197.html
Sent from
OK, JIRA created: https://issues.apache.org/jira/browse/SPARK-16080
Also, after looking at the code a bit I think I see the reason. If I'm
correct, it may actually be a very easy fix.
On Mon, Jun 20, 2016 at 1:21 PM Marcelo Vanzin wrote:
> It doesn't hurt to have a bug
Hi,
I have a DF based on a table and sorted and shown below
This is fine and when I register as tempTable I can populate the underlying
table sales 2 in Hive. That sales2 is an ORC table
val s = HiveContext.table("sales_staging")
val sorted =
It doesn't hurt to have a bug tracking it, in case anyone else has
time to look at it before I do.
On Mon, Jun 20, 2016 at 1:20 PM, Jonathan Kelly wrote:
> Thanks for the confirmation! Shall I cut a JIRA issue?
>
> On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin
Thanks for the confirmation! Shall I cut a JIRA issue?
On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin wrote:
> I just tried this locally and can see the wrong behavior you mention.
> I'm running a somewhat old build of 2.0, but I'll take a look.
>
> On Mon, Jun 20, 2016 at
I'll try dropping the maxRatePerPartition=400, or maybe even lower.
However even at application starts up I have this large scheduling
delay. I will report my progress later on.
On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger wrote:
> If your batch time is 1 second and your
I just tried this locally and can see the wrong behavior you mention.
I'm running a somewhat old build of 2.0, but I'll take a look.
On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly wrote:
> Does anybody have any thoughts on this?
>
> On Fri, Jun 17, 2016 at 6:36 PM
Thanks a lot for the response.
input1Pair is a DStream. I tried with the code snippet below,
result.foreachRDD{externalRDD =>
if(!externalRDD.isEmpty()){
val ss = input1Pair.transform{ rdd =>
input2Pair.leftOuterJoin(rdd)}
}else{
val ss = input1Pair.transform{
There are around twenty data generators in mllib -none of which are
presently migrated to ml.
Here is an example
/**
* :: DeveloperApi ::
* Generate sample data used for SVM. This class generates uniform random values
* for the features and adds Gaussian noise with weight 0.1 to generate
Hi,
How would you do that without/outside streaming?
Jacek
On 17 Jun 2016 12:12 a.m., "Amit Assudani" wrote:
> Hi All,
>
>
> Can I update batch data frames loaded in memory with Streaming data,
>
>
> For eg,
>
>
> I have employee DF is registered as temporary table, it
Hi,
I am using Amazon EMR for Spark and we have a 12 node cluster with 1 master and
11 core nodes.
However my cluster isn’t scaling up to capacity.
Would you have suggestions to alter Spark Cluster Settings to utilize my
cluster to its maximum capacity?
Resource Manager:
If you are using Spark > 1.5, the best way is to use DataFrame API directly,
instead of SQL. In dataframe, you can specify the boardcast join hint in the
dataframe API, which will force the boardcast join.
Yong
From: mich.talebza...@gmail.com
Date: Mon, 20 Jun 2016 13:09:17 +0100
Subject: Re:
If your batch time is 1 second and your average processing time is
1.16 seconds, you're always going to be falling behind. That would
explain why you've built up an hour of scheduling delay after eight
hours of running.
On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams
Does anybody have any thoughts on this?
On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly
wrote:
> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
> (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> log4j.properties is not getting picked up in the
I have a dataframe with 7 dimensions,
I built a cube on them
val cube = df.cube('d1,'d2,'d3,'d4,'d5,'d6,'d7)
val cc = cube.agg(sum('p1).as("p1"),sum('p2).as("p2")).cache
and then defined a rank function on a window:
val rankSpec =
Window.partitionBy('d1,'d2,'d3,'d4,'d5,'d6).orderBy('p1.desc)
Hi Praseetha,
In order to check if DStream is empty or not, using isEmpty method is
correct. I think the problem here is calling
input1Pair.lefOuterJoin(input2Pair).
I guess input1Pair rdd comes from above transformation. You should do it on
DStream instead. In this case, do any transformation
Hi Experts,
I have 2 inputs, where first input is stream (say input1) and the second
one is batch (say input2). I want to figure out if the keys in first input
matches single row or more than one row in the second input. The further
transformations/logic depends on the number of rows matching,
what sort of the tables are these?
Can you register the result set as temp table and do a join on that
assuming the RS is going to be small
s.filter(($"c2" < 1000)).registerTempTable("tmp")
and then do a join between tmp and Table2
HTH
Dr Mich Talebzadeh
LinkedIn *
Seems it is hard to predict the output size of filters because the current
spark has limited statistics of input data. A few hours ago, Reynold
created a ticket for cost-based optimizer framework in
https://issues.apache.org/jira/browse/SPARK-16026.
If you have ideas, questions, and suggestions,
Try this
This is for Oracle but should work for MSSQL. If you want ordering then do
it on DF
val d = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED,
to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
Hi,
How about caching the result of `select * from a where a.c2 < 1000`, then
joining them?
You probably need to tune `spark.sql.autoBroadcastJoinThreshold` to enable
broadcast joins for the result table.
// maropu
On Mon, Jun 20, 2016 at 8:06 PM, 梅西0247 wrote:
> Hi
Hi,
Currently, no.
spark cannot preserve the order of input data from jdbc.
If you want to have the ordered ids, you need to sort them explicitly.
// maropu
On Mon, Jun 20, 2016 at 7:41 PM, Ashok Kumar
wrote:
> Hi,
>
> I have a SQL server table with 500,000,000
Hi everyone,
I ran a SQL join statement on Spark 1.6.1 like this:
select * from table1 a join table2 b on a.c1 = b.c1 where a.c2 < 1000;and it
took quite a long time because It is a SortMergeJoin and the two tables are big.
In fact, the size of filter result(select * from a where a.c2 < 1000)
Hi,
I am trying Spark 2.0. I downloaded a prebuilt version
spark-2.0.0-preview-bin-hadoop2.7.tgz for trial and installed it on my testing
cluster. I had HDFS, YARN and Hive metastore service in position. When I
started the thrift server, it started as expected. When I tried to connect
There are a number of research papers about tensor factorization and its use in
machine learning.
Is tensor factorization in the roadmap?
Hi ,
i don t have no idea why i get this error
Py4JJavaError: An error occurred while calling o69143.parquet.
: org.apache.spark.SparkException: Job aborted.
at
Thank you very much.
On Mon, Jun 20, 2016 at 3:38 PM, Jörn Franke wrote:
> If you insert the data sorted then there is not need to bucket the data.
> You can even create an index in Spark. Simply set the outputformat
> configuration orc.create.index = true
>
>
> On 20 Jun
Thank you very much.
On Mon, Jun 20, 2016 at 3:10 PM, Mich Talebzadeh
wrote:
> Right, you concern is that you expect storeindex in ORC file to help the
> optimizer.
>
> Frankly I do not know what
> write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually
I'm not sure that's the conclusion. It's not trivial to tune and
configure YARN and Spark to match your app's memory needs and profile,
but, it's also just a matter of setting them properly. I'm not clear
you've set the executor memory for example, in particular
spark.yarn.executor.memoryOverhead
Hi Alexander ,
Thanks a lot for your comments.
Spark seems not that stable when it comes to run big job, too much data or too
much time, yes, the problem is gone when reducing the scale.Sometimes reset
some job running parameter (such as --drive-memory may help in GC issue) ,
sometimes may
I agree, it was by mistake.
I just updated so that the next person looking for torrent broadcast issues
will have a hint :)
Thank you.
Daniel
On Sun, Jun 19, 2016 at 5:26 PM, Ted Yu wrote:
> I think good practice is not to hold on to SparkContext in mapFunction.
>
> On
If you insert the data sorted then there is not need to bucket the data.
You can even create an index in Spark. Simply set the outputformat
configuration orc.create.index = true
> On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote:
>
> Right, you concern is that you
Its actually necessary to retire keys that become "Zero" or "Empty" so to
speak. In your case, the key is "imageURL" and values are a dictionary, one
of whose fields is "count" that you are maintaining. For simplicity and
illustration's sake I will assume imageURL to be a strings like "abc". Your
Hi Sir,
Please unsubscribe me
--
Regards,
Ram Krishna KT
Hi Mich,
Thank you for your reply.
Let me explain more clearly.
File with 100 records needs to joined with a Big lookup File created in ORC
format (500 million records). The Spark process i wrote is returing back
the matching records and is working fine. My concern is that it loads the
entire
43 matches
Mail list logo