State management in spark-streaming

2015-12-06 Thread Hafiz Mujadid
Hi, I have spark streaming with mqtt as my source. There are continuous events of flame sensors i.e. Fire and no Fire. I want to generate Fire event when the newly event is for Fire and want to ignore all subsequent event until No fire event is happened. Similarly If i get No-Fire Event i will

Spark sql data frames do they run in parallel by default?

2015-12-06 Thread kali.tumm...@gmail.com
Hi all, I wrote below spark code to extract data from SQL server using spark SQLContext.read.format with several different options , question does by default sqlContext.read load function run in parallel does it use all the available cores available ? when I am saving the output to a file it is

Re: Dataset and lambas

2015-12-06 Thread Koert Kuipers
that's good news about plans to avoid unnecessary conversions, and allow access to more efficient internal types. could you point me to the jiras, if they exist already? i just tried to find them but had little luck. best, koert On Sat, Dec 5, 2015 at 4:09 PM, Michael Armbrust

Implementing fail-fast upon critical spark streaming tasks errors

2015-12-06 Thread yam
When a spark streaming task is failed (after exceeding spark.task.maxFailures), the related batch job is considered failed and the driver continues to the next batch in the pipeline after updating checkpoint to the next checkpoint positions (the new offsets when using Kafka direct streaming). I'm

fail-fast or retry failed spark streaming jobs

2015-12-06 Thread yam
There are cases where spark streaming job tasks fails (one, several or all tasks) and there's not much sense in progressing to the next job while discarding the failed one. For example, when failing to connect to remote target DB, I would like to either fail-fast and relaunch the application from

Support for custom serializers in Checkpoint

2015-12-06 Thread Sela, Amit
Why does Spark allows only Java Serializable in Checkpointing ? I see in Checkpoint.serialize() that it doesn’t even try to load a serializer from the configuration and uses Java’s ObjectOutputStream. This means that I can’t use Avro (fro eaxmple) in updateStateByKey, right ? Is there a reason

Re: Experiences about NoSQL databases with Spark

2015-12-06 Thread ayan guha
Hi I have a general question. I want to do a real time aggrega*tion using spark. I have kinesis as source and planning ES as data source. there might be close to 2000 distinct events possible. I want to keep a runnning count of how many times each event occurs.* *Currently upon receiving an

Spark GraphX default Storage Level

2015-12-06 Thread prasad223
Hi All, I was trying to generate a subset of graphs using GraphGenerators provided in GraphX Package. My code is as shown below def generateGraph(config: Config, sparkContext: SparkContext) = { if (config.graphType == "LogNormal") {

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-12-06 Thread Mohamed Nadjib Mami
Your jars are not delivered to the workers. Have a look at this: http://stackoverflow.com/questions/24052899/how-to-make-it-easier-to-deploy-my-jar-to-spark-cluster-in-standalone-mode -- View this message in context:

Spark SQL 1.3 not finding attribute in DF

2015-12-06 Thread YaoPau
When I run df.printSchema() I get: root |-- durable_key: string (nullable = true) |-- code: string (nullable = true) |-- desc: string (nullable = true) |-- city: string (nullable = true) |-- state_code: string (nullable = true) |-- zip_code: string (nullable = true) |-- county: string

Re: Spark SQL 1.3 not finding attribute in DF

2015-12-06 Thread YaoPau
If anyone runs into the same issue, I found a workaround: >>> df.where('state_code = "NY"') works for me. >>> df.where(df.state_code == "NY").collect() fails with the error from the first post. -- View this message in context:

No support to save DataFrame in existing database table using DataFrameWriter.jdbc()

2015-12-06 Thread unk1102
Hi I would like to store/save DataFrame in a database table which is created already and want to insert into always without creating table every time. Unfortunately Spark API forces me to create table every time I have seen Spark source code the following calls uses same method beneath if you

Re: No support to save DataFrame in existing database table using DataFrameWriter.jdbc()

2015-12-06 Thread Ted Yu
Have you tried SaveMode.Append ? Cheers > On Dec 6, 2015, at 2:54 AM, unk1102 wrote: > > Hi I would like to store/save DataFrame in a database table which is created > already and want to insert into always without creating table every time. > Unfortunately Spark API

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread Zhiliang Zhu
On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Does this mean that I need to shuffle the all elements in different partitions into one partition, then

String Manipulation/Agregation

2015-12-06 Thread Shige Song
Dear All, I am trying to do a similar operation as is described here using SparkR. None of the solutions that works under plain R works with SparkR. Any suggestions are greatly appreciated. Best, Shige ​

Re: Could not load shims in class org.apache.hadoop.hive.schshim.FairSchedulerShim

2015-12-06 Thread Shige Song
Hard to tell. On Mon, Dec 7, 2015 at 11:35 AM, zhangjp <592426...@qq.com> wrote: > Hi all, > > I'm using saprk prebuild version 1.5.2+hadoop2.6 and hadoop version is > 2.6.2, when i use java client jdbc to execute sql,there has some issues. > > java.lang.RuntimeException: Could not load shims

Re: Possible bug in Spark 1.5.0 onwards while loading Postgres JDBC driver

2015-12-06 Thread manasdebashiskar
My apologies for making this problem sound bigger then it actually is. After many more coffee break I discovered that scalikejdbc ConnectionPool.singleton(NOT_HOST_BUT_JDBC_URL, user, password) takes a url and not a host(At least for the version 2.3+) Hence it throws a very legitimate looking

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread Zhiliang Zhu
On Monday, December 7, 2015 10:37 AM, DB Tsai wrote: Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Would you help write a few  pseudo-code about it...It seems that there is not shuffle related  API , or

Re: MLlib training time question

2015-12-06 Thread Haoyue Wang
Thanks Yanbo! I check the Spark UI, and found that in Exp 1), there are 52 jobs and 99 stages, in Exp 2), there are 105 jobs and 206 stages. The time spent on each jobs are 3s-4s, on each stages are 1-2s. That's why the Exp 2) take 2x times than Exp 1). And I also found that in Exp 2), the

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-06 Thread swetha kasireddy
Any documentation/sample code on how to use Ganglia with Spark? On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar wrote: > spark has capability to report to ganglia, graphite or jmx. > If none of that works for you you can register your own spark extra > listener > that

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread DB Tsai
Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu

how often you use Tachyon to accelerate Spark

2015-12-06 Thread Arvin
Hi,all Well, I have some questions about Tachyon and Spark. I found the interactive between Spark and Tachyon is the caching RDD use off-heap. I wonder if you guys use Tachyon frequently, such caching RDD by Tachyon? Is this action(caching rdd by tachyon) has a profound effect to accelerate

Could not load shims in class org.apache.hadoop.hive.schshim.FairSchedulerShim

2015-12-06 Thread zhangjp
Hi all, I'm using saprk prebuild version 1.5.2+hadoop2.6 and hadoop version is 2.6.2, when i use java client jdbc to execute sql,there has some issues. java.lang.RuntimeException: Could not load shims in class org.apache.hadoop.hive.schshim.FairSchedulerShim at

mllib.recommendations.als recommendForAll not ported to ml?

2015-12-06 Thread guillaume
I have experimented very low performance with the ALSModel.transform method when feeding it with even a small cartesian product of user x items. The former mllib implementation has a recommendForAll method to return topn items per users in an efficient way (using the blockify method to distribute

Re: Spark sql data frames do they run in parallel by default?

2015-12-06 Thread kali.tumm...@gmail.com
Hi All, I re wrote my code to use sqlContext.read.jdbc which lets me specify upperbound,lowerbound,numberofparitions etc .. which might run in parallel, I need to try on a cluster which I will do when I have time. But please confirm read.jdbc does parallel reads ? Spark code:- package

PySpark RDD with NumpyArray Structure

2015-12-06 Thread Mustafa Elbehery
Hi All, I would like to parallelize Python NumpyArray to apply scikit Learn algorithm on top of Spark. When I call *sc.parallelize() *I receive rdd of different structure. To be more precise, I am trying to have the following, X = [[ 0.49426097 1.45106697] [-1.42808099 -0.83706377] [

spark-shell launch not clean

2015-12-06 Thread Navdeep Kainth
Hello, I am trying to set up Spark with Cassandra, I am able to run some one-liners scripts just fine, Could you please point to the documentation to get a clean spark-shell, what configuration file need to be tweaked ? -bash-4.1$ ./bin/spark-shell jars

Inconsistent data in Cassandra

2015-12-06 Thread Priya Ch
Hi All, I have the following scenario in writing rows to Cassandra from Spark Streaming - in a 1 sec batch, I have 3 tickets with same ticket number (primary key) but with different envelope numbers (i.e envelope 1, envelope 2, envelope 3.) I am writing these messages to Cassandra using

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
cc parquet-dev list (it would be nice to always do so for these general questions.) Cheng On 12/6/15 3:10 PM, Shushant Arora wrote: Hi I have few doubts on parquet file format. 1.Does parquet keeps min max statistics like in ORC. how can I see parquet version(whether its1.1,1.2or1.3) for

Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Z Z
I have two RDDs, one really large in size and other much smaller. I'd like find all unique tuples in large RDD with keys from the small RDD. There are duplicates tuples as well and I only care about the distinct tuples. For example large_rdd = sc.parallelize([('abcdefghij'[i%10], i) for i in

Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Fengdong Yu
Don’t do Join firstly. broadcast your small RDD, val bc = sc.broadcast(small_rdd) then large_dd.filter(x.key in bc.value).map( x => { bc.value.other_fileds + x }).distinct.groupByKey > On Dec 7, 2015, at 1:41 PM, Z Z wrote: > > I have two RDDs, one really

Re: parquet file doubts

2015-12-06 Thread Cheng Lian
Oh sorry... At first I meant to cc spark-user list since Shushant and I had been discussed some Spark related issues before. Then I realized that this is a pure Parquet issue, but forgot to change the cc list. Thanks for pointing this out! Please ignore this thread. Cheng On 12/7/15 12:43

Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Z Z
Thanks Fengdong. small_rdd is still relatively big that we can't broadcast (to each node). It has to be kept distributed. On Sun, Dec 6, 2015 at 9:52 PM, Fengdong Yu wrote: > Don’t do Join firstly. > > broadcast your small RDD, > > val bc = sc.broadcast(small_rdd) > >

Find all simple paths of a maximum specific length between two nodes of a graph.

2015-12-06 Thread kauarba
Hi Experts, I am new to GraphX. I am trying to find all simple paths of maximum specific length(say <=r) between two nodes of a undirected graph having a million nodes. Can you please advice me on the complexity of doing such thing in GraphX. Also can you please let me know on what are the APIs

Task Time is too high in a single executor in Streaming

2015-12-06 Thread SRK
Hi, In my Streaming Job, most of the time seems to be taken by one executor. The shuffle read records is 713758 in that one particular executor but 0 in others. I have a groupBy followed by updateStateByKey, flatMap, map, reduceByKey and updateStateByKey operations in that Stage. I am suspecting

Re: parquet file doubts

2015-12-06 Thread Ted Yu
Cheng: I only see user@spark in the CC. FYI On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian wrote: > cc parquet-dev list (it would be nice to always do so for these general > questions.) > > Cheng > > On 12/6/15 3:10 PM, Shushant Arora wrote: > >> Hi >> >> I have few doubts on

How to get the list of running applications and Cores/Memory in use?

2015-12-06 Thread Haopu Wang
Hi, I have a Spark 1.5.2 standalone cluster running. I want to get all of the running applications and Cores/Memory in use. Besides the Master UI, is there any other ways to do that? I tried to send HTTP request using URL like this: "http://node1:6066/v1/applications; The

Re: Sharing object/state accross transformations

2015-12-06 Thread Julian Keppel
Yes, but what they do is to only add new elements to a state which is passed as parameter. But my problem is that my "counter" (the hyperloglog object) comes from outside and is not passed to the function. So i have to track the state of this "external" hll object accross the whole lifecycle of