回复:Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread 梅西0247
Hi ,I think it is related to this issue [Adaptive execution in Spark] https://issues.apache.org/jira/browse/SPARK-9850 I will learn more about it. --发件人:梅西0247 发送时间:2016年6月21日(星期二) 10:31收件人:Mich Talebzadeh

回复:Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread 梅西0247
To Yong Zhang:Yes, a broadcast join hint works. But it is not what I want.Sometimes the result is really too big to cast a broadcast on it.  What I want is a more adaptive implementation. --发件人:Yong Zhang

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Ted Yu
What operations did you run in the Spark shell ? It would be easier for other people to reproduce using your code snippet. Thanks On Mon, Jun 20, 2016 at 6:20 PM, Jeff Zhang wrote: > Could you check the yarn app logs for details ? run command "yarn logs > -applicationId " to

Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the latter seems to be close to having support for Spark 2.X. While I am interested in using Spark Notebook as soon as that support were available are there alternatives that work *now*? For example some unmerged -yet -working

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Jeff Zhang
Could you check the yarn app logs for details ? run command "yarn logs -applicationId " to get the yarn log On Tue, Jun 21, 2016 at 9:18 AM, wgtmac wrote: > I ran into problems in building Spark 2.0. The build process actually > succeeded but when I uploaded to cluster and

Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread wgtmac
I ran into problems in building Spark 2.0. The build process actually succeeded but when I uploaded to cluster and launched the Spark shell on YARN, it reported following exceptions again and again: 16/06/17 03:32:00 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan
You have to create input1Pair inside foreachRDD function. As your original code, one solution will look like as following: val input2Pair = input2.map(x => (x._1, x)) input2Pair.cache() streamData.foreachRDD{rdd => if(!rdd.isEmpty()){ val input1Pair = rdd.map(x => (x._1, x)) val joinData

javax.net.ssl.SSLHandshakeException: unable to find valid certification path to requested target

2016-06-20 Thread Utkarsh Sengar
We are intermittently getting this error when spark tried to load data from S3:Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.

Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2016-06-20 Thread Satya
Hello, We are also experiencing the same error. Can you please provide the steps that resolved the issue. Thanks Satya -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p27197.html Sent from

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
OK, JIRA created: https://issues.apache.org/jira/browse/SPARK-16080 Also, after looking at the code a bit I think I see the reason. If I'm correct, it may actually be a very easy fix. On Mon, Jun 20, 2016 at 1:21 PM Marcelo Vanzin wrote: > It doesn't hurt to have a bug

Saving data using tempTable versus save() method

2016-06-20 Thread Mich Talebzadeh
Hi, I have a DF based on a table and sorted and shown below This is fine and when I register as tempTable I can populate the underlying table sales 2 in Hive. That sales2 is an ORC table val s = HiveContext.table("sales_staging") val sorted =

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Marcelo Vanzin
It doesn't hurt to have a bug tracking it, in case anyone else has time to look at it before I do. On Mon, Jun 20, 2016 at 1:20 PM, Jonathan Kelly wrote: > Thanks for the confirmation! Shall I cut a JIRA issue? > > On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Thanks for the confirmation! Shall I cut a JIRA issue? On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin wrote: > I just tried this locally and can see the wrong behavior you mention. > I'm running a somewhat old build of 2.0, but I'll take a look. > > On Mon, Jun 20, 2016 at

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Colin Kincaid Williams
I'll try dropping the maxRatePerPartition=400, or maybe even lower. However even at application starts up I have this large scheduling delay. I will report my progress later on. On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger wrote: > If your batch time is 1 second and your

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Marcelo Vanzin
I just tried this locally and can see the wrong behavior you mention. I'm running a somewhat old build of 2.0, but I'll take a look. On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly wrote: > Does anybody have any thoughts on this? > > On Fri, Jun 17, 2016 at 6:36 PM

Re: Verifying if DStream is empty

2016-06-20 Thread Praseetha
Thanks a lot for the response. input1Pair is a DStream. I tried with the code snippet below, result.foreachRDD{externalRDD => if(!externalRDD.isEmpty()){ val ss = input1Pair.transform{ rdd => input2Pair.leftOuterJoin(rdd)} }else{ val ss = input1Pair.transform{

Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are presently migrated to ml. Here is an example /** * :: DeveloperApi :: * Generate sample data used for SVM. This class generates uniform random values * for the features and adds Gaussian noise with weight 0.1 to generate

Re: Update Batch DF with Streaming

2016-06-20 Thread Jacek Laskowski
Hi, How would you do that without/outside streaming? Jacek On 17 Jun 2016 12:12 a.m., "Amit Assudani" wrote: > Hi All, > > > Can I update batch data frames loaded in memory with Streaming data, > > > For eg, > > > I have employee DF is registered as temporary table, it

Underutilized Cluster

2016-06-20 Thread Chadha Pooja
Hi, I am using Amazon EMR for Spark and we have a 12 node cluster with 1 master and 11 core nodes. However my cluster isn’t scaling up to capacity. Would you have suggestions to alter Spark Cluster Settings to utilize my cluster to its maximum capacity? Resource Manager:

RE: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Yong Zhang
If you are using Spark > 1.5, the best way is to use DataFrame API directly, instead of SQL. In dataframe, you can specify the boardcast join hint in the dataframe API, which will force the boardcast join. Yong From: mich.talebza...@gmail.com Date: Mon, 20 Jun 2016 13:09:17 +0100 Subject: Re:

Re: Improving performance of a kafka spark streaming app

2016-06-20 Thread Cody Koeninger
If your batch time is 1 second and your average processing time is 1.16 seconds, you're always going to be falling behind. That would explain why you've built up an hour of scheduling delay after eight hours of running. On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams

Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Does anybody have any thoughts on this? On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly wrote: > I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT > (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's > log4j.properties is not getting picked up in the

dense_rank skips ranks on cube

2016-06-20 Thread talgr
I have a dataframe with 7 dimensions, I built a cube on them val cube = df.cube('d1,'d2,'d3,'d4,'d5,'d6,'d7) val cc = cube.agg(sum('p1).as("p1"),sum('p2).as("p2")).cache and then defined a rank function on a window: val rankSpec = Window.partitionBy('d1,'d2,'d3,'d4,'d5,'d6).orderBy('p1.desc)

Re: Verifying if DStream is empty

2016-06-20 Thread nguyen duc tuan
Hi Praseetha, In order to check if DStream is empty or not, using isEmpty method is correct. I think the problem here is calling input1Pair.lefOuterJoin(input2Pair). I guess input1Pair rdd comes from above transformation. You should do it on DStream instead. In this case, do any transformation

Verifying if DStream is empty

2016-06-20 Thread Praseetha
Hi Experts, I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2). I want to figure out if the keys in first input matches single row or more than one row in the second input. The further transformations/logic depends on the number of rows matching,

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Mich Talebzadeh
what sort of the tables are these? Can you register the result set as temp table and do a join on that assuming the RS is going to be small s.filter(($"c2" < 1000)).registerTempTable("tmp") and then do a join between tmp and Table2 HTH Dr Mich Talebzadeh LinkedIn *

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
Seems it is hard to predict the output size of filters because the current spark has limited statistics of input data. A few hours ago, Reynold created a ticket for cost-based optimizer framework in https://issues.apache.org/jira/browse/SPARK-16026. If you have ideas, questions, and suggestions,

Re: JDBC load into tempTable

2016-06-20 Thread Mich Talebzadeh
Try this This is for Oracle but should work for MSSQL. If you want ordering then do it on DF val d = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
Hi, How about caching the result of `select * from a where a.c2 < 1000`, then joining them? You probably need to tune `spark.sql.autoBroadcastJoinThreshold` to enable broadcast joins for the result table. // maropu On Mon, Jun 20, 2016 at 8:06 PM, 梅西0247 wrote: > Hi

Re: JDBC load into tempTable

2016-06-20 Thread Takeshi Yamamuro
Hi, Currently, no. spark cannot preserve the order of input data from jdbc. If you want to have the ordered ids, you need to sort them explicitly. // maropu On Mon, Jun 20, 2016 at 7:41 PM, Ashok Kumar wrote: > Hi, > > I have a SQL server table with 500,000,000

Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread 梅西0247
Hi everyone, I ran a SQL join statement on Spark 1.6.1 like this: select * from table1 a join table2 b on a.c1 = b.c1 where a.c2 < 1000;and it took quite a long time because It is a SortMergeJoin and the two tables are big. In fact,  the size of filter result(select * from a where a.c2 < 1000)

Beeline exception when connecting to Spark 2.0 ThriftServer running on yarn

2016-06-20 Thread Lei Lei2 Gu
Hi, I am trying Spark 2.0. I downloaded a prebuilt version spark-2.0.0-preview-bin-hadoop2.7.tgz for trial and installed it on my testing cluster. I had HDFS, YARN and Hive metastore service in position. When I started the thrift server, it started as expected. When I tried to connect

tensor factorization FR

2016-06-20 Thread Roberto Pagliari
There are a number of research papers about tensor factorization and its use in machine learning. Is tensor factorization in the roadmap?

Unable to acquire bytes of memory

2016-06-20 Thread pseudo oduesp
Hi , i don t have no idea why i get this error Py4JJavaError: An error occurred while calling o69143.parquet. : org.apache.spark.SparkException: Job aborted. at

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mohanraj Ragupathiraj
Thank you very much. On Mon, Jun 20, 2016 at 3:38 PM, Jörn Franke wrote: > If you insert the data sorted then there is not need to bucket the data. > You can even create an index in Spark. Simply set the outputformat > configuration orc.create.index = true > > > On 20 Jun

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mohanraj Ragupathiraj
Thank you very much. On Mon, Jun 20, 2016 at 3:10 PM, Mich Talebzadeh wrote: > Right, you concern is that you expect storeindex in ORC file to help the > optimizer. > > Frankly I do not know what > write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually

Re: spark job automatically killed without rhyme or reason

2016-06-20 Thread Sean Owen
I'm not sure that's the conclusion. It's not trivial to tune and configure YARN and Spark to match your app's memory needs and profile, but, it's also just a matter of setting them properly. I'm not clear you've set the executor memory for example, in particular spark.yarn.executor.memoryOverhead

Re: spark job automatically killed without rhyme or reason

2016-06-20 Thread Zhiliang Zhu
Hi Alexander , Thanks a lot for your comments. Spark seems not that stable when it comes to run big job, too much data or too much time, yes, the problem is gone when reducing the scale.Sometimes reset some job running parameter (such as --drive-memory may help in GC issue) , sometimes may

Re: Switching broadcast mechanism from torrrent

2016-06-20 Thread Daniel Haviv
I agree, it was by mistake. I just updated so that the next person looking for torrent broadcast issues will have a hint :) Thank you. Daniel On Sun, Jun 19, 2016 at 5:26 PM, Ted Yu wrote: > I think good practice is not to hold on to SparkContext in mapFunction. > > On

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Jörn Franke
If you insert the data sorted then there is not need to bucket the data. You can even create an index in Spark. Simply set the outputformat configuration orc.create.index = true > On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote: > > Right, you concern is that you

Re: Spark Kafka stream processing time increasing gradually

2016-06-20 Thread N B
Its actually necessary to retire keys that become "Zero" or "Empty" so to speak. In your case, the key is "imageURL" and values are a dictionary, one of whose fields is "count" that you are maintaining. For simplicity and illustration's sake I will assume imageURL to be a strings like "abc". Your

Unsubscribe

2016-06-20 Thread Ram Krishna
Hi Sir, Please unsubscribe me -- Regards, Ram Krishna KT

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mohanraj Ragupathiraj
Hi Mich, Thank you for your reply. Let me explain more clearly. File with 100 records needs to joined with a Big lookup File created in ORC format (500 million records). The Spark process i wrote is returing back the matching records and is working fine. My concern is that it loads the entire