streaming+sql with block has been removed error

2015-11-05 Thread ZhuGe
Hi all:I am trying to implement the "spark streaming +sql and dataframe" case described in this post https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.htmlI use rabbit mq as the datasource.My code sample is like this: countByValueAndWindow(Seconds(5),

DataFrame equality does not working in 1.5.1

2015-11-05 Thread 千成徳
Hi All, I have data frame like this. Equality expression is not working in 1.5.1 but, works as expected in 1.4.0 What is the difference? scala> eventDF.printSchema() root |-- id: string (nullable = true) |-- event: string (nullable = true) |-- entityType: string (nullable = true) |--

Re: Scheduling Spark process

2015-11-05 Thread Adrian Tanase
You should also specify how you’re planning to query or “publish” the data. I would consider a combination of: - spark streaming job that ingests the raw events in real time, validates, pre-process and saves to stable storage - stable storage could be HDFS/parquet or a database optimized for

converting categorical values in csv file to numerical values

2015-11-05 Thread Balachandar R.A.
HI I am new to spark MLlib and machine learning. I have a csv file that consists of around 100 thousand rows and 20 columns. Of these 20 columns, 10 contains string values. Each value in these columns are not necessarily unique. They are kind of categorical, that is, the values could be one

Re: SparkSQL JDBC to PostGIS

2015-11-05 Thread Mustafa Elbehery
Hi Stefano, Thanks for prompt reply. Actually I am using *Magellan, *a geospatial library on top of spark. I know that I can load the data in RDDs, or DFs, and use them directly. However, for requirement purposes, I am trying to query the data from PostGIS directly. So, as I have mentioned

Scheduling Spark process

2015-11-05 Thread danilo
Hi All, I'm quite new about this topic and about Spark in general. I have a sensor that is pushing data in real time and I need to calculate some KPIs based on the data I have received. Given that some of the KPIs are related to very old data (e.g. average of number of event in the last 3

How to run parallel on each DataFrame group

2015-11-05 Thread patcharee
Hi, I need suggestions on my coding. I would like to split DataFrame (rowDF) by a column (depth) into groups. Then sort each group, repartition and save output of each group into one file. See code below> val rowDF = sqlContext.createDataFrame(rowRDD, schema).cache() for (i <- 0 to 16) {

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-05 Thread Sjoerd Mulder
Hi Reynold, I had version 2.6.1 in my project which was provided by the fine folks from spring-boot-dependencies. Now have overridden it to 2.7.8 :) Sjoerd 2015-11-01 8:22 GMT+01:00 Reynold Xin : > Thanks for reporting it, Sjoerd. You might have a different version of >

Spark task hangs infinitely when accessing S3 from AWS

2015-11-05 Thread aecc
Hi guys, when reading data from S3 from AWS using Spark 1.5.1 one of the tasks hangs when reading data in a way that cannot be reproduced. Some times it hangs, some times it doesn't. This is the thread dump from the hung task: "Executor task launch worker-3" daemon prio=10 tid=0x7f419c023000

Re: Protobuff 3.0 for Spark

2015-11-05 Thread Steve Loughran
> On 5 Nov 2015, at 00:12, Lan Jiang wrote: > > I have used protobuf 3 successfully with Spark on CDH 5.4, even though Hadoop > itself comes with protobuf 2.5. I think the steps apply to HDP too. You need > to do the following Protobuf.jar has been so brittle in the past

Re: Spark standalone: zookeeper timeout configuration

2015-11-05 Thread yueqianzhu
spark use curator(a zookeeper client) for high availability,you can config curator ,it has some retry policies to chooses. thanks . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-zookeeper-timeout-configuration-tp25224p25290.html Sent from

Re: converting categorical values in csv file to numerical values

2015-11-05 Thread tog
Hi Bala Can't you do a simple dictionnary and map those values to numbers? Cheers Guillaume On 5 November 2015 at 09:54, Balachandar R.A. wrote: > HI > > > I am new to spark MLlib and machine learning. I have a csv file that > consists of around 100 thousand rows and

Re: JMX with Spark

2015-11-05 Thread Yogesh Vyas
Hi, Please let me elaborate my question so that you will get to know what exactly I want. I am running a Spark Streaming job. This job is to count number of occurrence of the event. Right now I am using a key/value pair RDD which tells me the count of an event, where key is the event and value is

very slow parquet file write

2015-11-05 Thread Rok Roskar
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no obvious effect). I was expecting the dump to

JMX with Spark

2015-11-05 Thread Yogesh Vyas
Hi, How we can use JMX and JConsole to monitor our Spark applications? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Scheduling Spark process

2015-11-05 Thread Danilo Rizzo
Hi Adrian, yes, your assumption is correct. I'm using HBase for storing the partial calculations. Thank you for the feedbacks - it is exactly what I had in mind. Thx D On Thu, Nov 5, 2015 at 10:43 AM, Adrian Tanase wrote: > You should also specify how you’re planning to

Re: Spark reading from S3 getting very slow

2015-11-05 Thread Steve Loughran
On 5 Nov 2015, at 02:03, Younes Naguib > wrote: Hi all, I’m reading large text files from s3. Sizes between from 30GB and 40GB. Every stage runs in 8-9s, except the last 32, jumps to 1mn-2mn for some reason! Here is my

Fwd: UnresolvedException - lag, window

2015-11-05 Thread Jiří Syrový
Hi, I'm getting the following exception with Spark 1.5.2-rc2 (haven't tried 1.6.0 yet though) when using window function lag: [2015-11-05 10:58:50,806] ERROR xo.builder.jobs.CompareJob [] [akka://JobServer/user/context-supervisor/MYCONTEXT] - Comparison has failed

Re: JMX with Spark

2015-11-05 Thread Romi Kuntsman
Have you read this? https://spark.apache.org/docs/latest/monitoring.html *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Nov 5, 2015 at 2:08 PM, Yogesh Vyas wrote: > Hi, > How we can use JMX and JConsole to monitor our Spark applications? > >

Re: Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Richard Hillegas
Or you may be referring to https://issues.apache.org/jira/browse/SPARK-10648. That issue has a couple pull requests but I think that the limited bandwidth of the committers still applies. Thanks, Rick Richard Hillegas/San Francisco/IBM@IBMUS wrote on 11/05/2015 09:16:42 AM: > From: Richard

Re: Rule Engine for Spark

2015-11-05 Thread Buntu Dev
You may want to read this post regarding Spark with Drools: http://blog.cloudera.com/blog/2015/11/how-to-build-a-complex-event-processing-app-on-apache-spark-and-drools/ On Wed, Nov 4, 2015 at 8:05 PM, Daniel Mahler wrote: > I am not familiar with any rule engines on Spark

Spark EC2 script on Large clusters

2015-11-05 Thread Christian
For starters, thanks for the awesome product! When creating ec2-clusters of 20-40 nodes, things work great. When we create a cluster with the provided spark-ec2 script, it takes hours. When creating a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it takes over 5 hours. One

101 question on external metastore

2015-11-05 Thread Yana Kadiyska
Hi folks, trying experiment with a minimal external metastore. I am following the instructions here: https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode I grabbed Derby 10.12.1.1 and started an instance, verified I can connect via ij tool and that process is listening on 1527

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Shivaram Venkataraman
It is a known limitation that spark-ec2 is very slow for large clusters and as you mention most of this is due to the use of rsync to transfer things from the master to all the slaves. Nick cc'd has been working on an alternative approach at https://github.com/nchammas/flintrock that is more

Re: How to unpersist a DStream in Spark Streaming

2015-11-05 Thread swetha kasireddy
Its just in the same thread for a particular RDD, I need to uncache it every 2 minutes to clear out the data that is present in a Map inside that. On Wed, Nov 4, 2015 at 11:54 PM, Saisai Shao wrote: > Hi Swetha, > > Would you mind elaborating your usage scenario of

Re: Subtract on rdd2 is throwing below exception

2015-11-05 Thread Yana Kadiyska
subtract is not the issue. Spark is lazy so a lot of times you'd have many, many lines of code which does not in fact run until you do some action (in your case, subtract). As you can see from the stacktrace, the NPE is from joda which is used in the partitioner (Im suspecting in Cassandra).But

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
I am not looking for Spark Sql specifically. My usecase is that I need to save an RDD as a parquet file in hdfs at the end of a batch and load it back and convert it into an RDD in the next batch. The RDD has a String and a Long as the key/value pairs. On Wed, Nov 4, 2015 at 11:52 PM, Stefano

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Nicholas Chammas
Yeah, as Shivaram mentioned, this issue is well-known. It's documented in SPARK-5189 and a bunch of related issues. Unfortunately, it's hard to resolve this issue in spark-ec2 without rewriting large parts of the project. But if you take a crack

Spark using Yarn timelineserver - High CPU usage

2015-11-05 Thread Krzysztof Zarzycki
Hi there, I have a serious problem in my Hadoop cluster, that YARN Timeline server generates very high load, 800% CPU when there are 8 Spark Streaming jobs running in parallel. I discuss this problem on Hadoop group in parallel:

Re: Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Richard Hillegas
Hi Rajesh, I think that you may be referring to https://issues.apache.org/jira/browse/SPARK-10909. A pull request on that issue was submitted more than a month ago but it has not been committed. I think that the committers are busy working on issues which were targeted for 1.6 and I doubt that

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Christian
Let me rephrase. Emr cost is about twice as much as the spot price, making it almost 2/3 of the overall cost. On Thu, Nov 5, 2015 at 11:50 AM Christian wrote: > Hi Johnathan, > > We are using EMR now and it's costing way too much. We do spot pricing and > the emr addon cost is

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
How to convert a parquet file that is saved in hdfs to an RDD after reading the file from hdfs? On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman wrote: > Hi, > we are using avro with compression(snappy). As soon as you have enough > partitions, the saving won't be a problem

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR that

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Christian
Hi Johnathan, We are using EMR now and it's costing way too much. We do spot pricing and the emr addon cost is about 2/3 the price of the actual spot instance. On Thu, Nov 5, 2015 at 11:31 AM Jonathan Kelly wrote: > Christian, > > Is there anything preventing you from

Re: Building scaladoc using "build/sbt unidoc" failure

2015-11-05 Thread vectorijk
Justin, I encountered same failure. Any one has any idea? But alternative is to generate scala doc via Intellij. In Intellij. Tools > Generate Scaladoc. Best, Kai. -- View this message in context:

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jonathan Kelly
Christian, Is there anything preventing you from using EMR, which will manage your cluster for you? Creating large clusters would take mins on EMR instead of hours. Also, EMR supports growing your cluster easily and recently added support for shrinking your cluster gracefully (even while jobs are

Re: Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-05 Thread Marcelo Vanzin
On Thu, Nov 5, 2015 at 3:41 PM, Joey Paskhay wrote: > We verified the Guava libraries are in the huge list of the included jars, > but we saw that in the > org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method > it seems to assume that *all*

Spark Analytics

2015-11-05 Thread Andrés Ivaldi
Hello, I'm newbie at spark world, With my team are analyzing Spark as integration frameworks between different sources, so far so good, but I't becomes slow when aggregations and calculations are applied to the RDD. Im using Spark as standalone and under windows. I'm running this exalple: -

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-11-05 Thread Jon Gregg
Here's my code: my_data = sqlCtx.sql("SELECT * FROM raw.site_activity_data LIMIT 2") my_data.collect() raw.site_activity_data is a Hive external table atop daily-partitioned .gzip data. When I execute the command I start seeing many of these pop up in the logs (below is a small subset)

Spark Slave always fails to connect to master

2015-11-05 Thread أنس الليثي
I am trying to install a standalone spark cluster. I prepared 3 virtual machines each with Ubuntu installed. The three machines consists a cluster with one master and two slaves. I followed the steps in the documentation of Apache spark. I started the master script from the master node and it

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
OK. I found the following code that does that. def readParquetRDD[T <% SpecificRecord](sc: SparkContext, parquetFile: String)(implicit tag: ClassTag[T]): RDD[T] = { val jobConf= new JobConf(sc.hadoopConfiguration) ParquetInputFormat.setReadSupportClass(jobConf, classOf[AvroReadSupport[T]])

very slow parquet file write

2015-11-05 Thread rok
Apologies if this appears a second time! I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no

Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-05 Thread Joey Paskhay
Hey there, We believe we may have run into a class loading bug w/ Guava libraries when trying to configure a different version of the Hive metastore. Saw another user ran into this too -- email on this list with subject "Troubles interacting with different version of Hive metastore". We figured

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
java/scala? I think there is everything in dataframes tutorial *e.g. if u have dataframe and working from java - toJavaRDD * () On 5 November 2015 at 21:13, swetha kasireddy

Re: Spark SQL supports operating on a thrift data sources

2015-11-05 Thread Michael Armbrust
This would make an awesome spark-packge. I'd suggest looking at spark-avro as an example: https://github.com/databricks/spark-avro On Thu, Nov 5, 2015 at 11:21 AM, Jaydeep Vishwakarma < jaydeep.vishwaka...@inmobi.com> wrote: > Hi, > > I want to load thrift serialised data through sqlcontext and

Kinesis connection timeout setting on Spark Streaming Kinesis ASL

2015-11-05 Thread Hster Geguri
Is there any way to set the underlying AWS client connection socket timeout for the kinesis requests made in the spark-streaming-kinesis-asl? Currently we get socket timeouts which appear to default to about 120 seconds on driver restarts causing all kinds of backup. We'd like to shorten it to 10

Spark Dynamic Partitioning Bug

2015-11-05 Thread Bryan Jeffrey
Hello. I have come across some odd behavior with writing to persistent Hive tables in Spark using dynamic partitioning. Basically, I create a table. Using Spark streaming create counts of events by ID and source. For each RDD I create a temporary table. I then select from the temporary table

Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Yin Huai
Can you attach the result of eventDF.filter($"entityType" === "user").select("entityId").distinct.explain(true)? Thanks, Yin On Thu, Nov 5, 2015 at 1:12 AM, 千成徳 wrote: > Hi All, > > I have data frame like this. > > Equality expression is not working in 1.5.1 but, works as

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
No scala. Suppose I read the Parquet file as shown in the following. How would that be converted to an RDD to use it in my Spark Batch. I use Core Spark. I don't use Spark SQL. ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[ AminoAcid]]) val file =

"Master: got disassociated, removing it."

2015-11-05 Thread Khaled Ammar
Hi, I am using GRAPHX in standalone SPARK 1.5.1 in a medium size cluster (64+1). I could execute PageRank with large number of iterations on this cluster. However, when I run SSSP, it always fail at iteration 23 or 24. This is always at after about 11 mins. Note that PageRank takes more than

Re: Failed to save RDD as text file to local file system

2015-11-05 Thread Hitoshi Ozawa
I'm a little bit late but posting in case somebody googles this. It seems saveAsTextFile requires chmod 777 but the local directory won't default to give w to other users. I've tried saving to a mounted drive and was able to save without an error. Without the the "file", it won't save to the

Re: Spark RDD cache persistence

2015-11-05 Thread Christian
I've never had this need and I've never done it. There are options that allow this. For example, I know there are web apps out there that work like the spark REPL. One of these I think is called Zepplin. . I've never used them, but I've seen them demoed. There is also Tachyon that Spark supports..

Unable to register UDF with StructType

2015-11-05 Thread Rishabh Bhardwaj
Hi all, I am unable to register a UDF with return type as StructType: scala> def test(r:StructType):StructType = { r } > > test: (r: >> org.apache.spark.sql.types.StructType)org.apache.spark.sql.types.StructType > > >> scala> sqlContext.udf.register("test",test _ ) > > scala.MatchError:

Re: Unable to register UDF with StructType

2015-11-05 Thread Michael Armbrust
You are returning the type StructType not an instance of a struct (i.e. StringType instead of "string"). If you'd like to return a struct you should return a case class. case class StringInfo(numChars: Int, firstLetter: String) udf((s: String) => StringInfo(s.size, s.head)) If you'd like to

Re: converting categorical values in csv file to numerical values

2015-11-05 Thread tog
If you corpus is large (nlp) this is indeed the best solution otherwise (few words I.e. Categories) I guess you will end up with the same result On Friday, 6 November 2015, Balachandar R.A. wrote: > Hi Guillaume, > > > This is always an option. However, I read about

Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
Hi All I am confused on RDD persistence in cache . If I cache RDD , is it going to stay there in memory even if my spark program completes execution , which created it. If not , how can I guarantee that RDD is persisted in cache even after the program finishes execution. Thanks Deepak

Re: Re: Spark RDD cache persistence

2015-11-05 Thread r7raul1...@163.com
You can try http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory . Hive tmp table use this function to speed job. https://issues.apache.org/jira/browse/HIVE-7313 r7raul1...@163.com From: Christian Date: 2015-11-06 13:50

Re: Re: Spark RDD cache persistence

2015-11-05 Thread Deenar Toraskar
You can have a long running Spark context in several fashions. This will ensure your data will be cached in memory. Clients will access the RDD through a REST API that you can expose. See the Spark Job Server, it does something similar. It has something called Named RDDs Using Named RDDs Named

Re: How to use data from Database and reload every hour

2015-11-05 Thread Sabarish Sasidharan
Theoretically the executor is a long lived container. So you could use some simple caching library or a simple Singleton to cache the data in your executors, once they load it from mysql. But note that with lots of executors you might choke your mysql. Regards Sab On 05-Nov-2015 7:03 pm, "Kay-Uwe

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail:

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan
Qubole uses yarn. Regards Sab On 06-Nov-2015 8:31 am, "Jerry Lam" wrote: > Does Qubole use Yarn or Mesos for resource management? > > Sent from my iPhone > > > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > > > > Qubole >

Re: How to lookup by a key in an RDD

2015-11-05 Thread swetha kasireddy
I read about the IndexedRDD. Is the IndexedRDD join with another RDD that is not an IndexedRDD efficient? On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar wrote: > Swetha > > Currently IndexedRDD is an external library and not part of Spark Core. > You can use it by

Re: Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Madabhattula Rajesh Kumar
Hi Richard, Thank you for the updates. Do you know tentative timeline for 1.6 release? Mean while, any workaround solution for this issue? Regards, Rajesh On Thu, Nov 5, 2015 at 10:57 PM, Richard Hillegas wrote: > Or you may be referring to >

Re: Spark RDD cache persistence

2015-11-05 Thread Christian
The cache gets cleared out when the job finishes. I am not aware of a way to keep the cache around between jobs. You could save it as an object file to disk and load it as an object file on your next job for speed. On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma wrote: > Hi

Re: Spark Streaming data checkpoint performance

2015-11-05 Thread Thúy Hằng Lê
Thankd all, it would be great to have this feature soon. Do you know what's the release plan for 1.6? In addition to this, I still have checkpoint performance problem My code is just simple like this: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,Durations.seconds(2));

Re: Spark RDD cache persistence

2015-11-05 Thread Deepak Sharma
Thanks Christian. So is there any inbuilt mechanism in spark or api integration to other inmemory cache products such as redis to load the RDD to these system upon program exit ? What's the best approach to have long lived RDD cache ? Thanks Deepak On 6 Nov 2015 8:34 am, "Christian"

Re: converting categorical values in csv file to numerical values

2015-11-05 Thread Balachandar R.A.
Hi Guillaume, This is always an option. However, I read about HashingTF which exactly does this quite efficiently and can scale too. Hence, looking for a solution using this technique. regards Bala On 5 November 2015 at 18:50, tog wrote: > Hi Bala > > Can't you

Re: Guava ClassLoading Issue When Using Different Hive Metastore Version

2015-11-05 Thread Michael Armbrust
I would be in favor of limiting the scope here. The problem you might run into is that FinalizableReferenceQueue uses the

Re: DataFrame equality does not working in 1.5.1

2015-11-05 Thread Seongduk Cheon
Hi, Yin Thanks for your time. This is the result. -- scala> eventDF.filter($"entityType" === "user").select("entityId").distinct.explain(true) == Parsed Logical Plan == Aggregate [entityId#16], [entityId#16] Project [entityId#16] Filter (entityType#15 = user) Project [_1#0

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Sabarish Sasidharan
Qubole is one option where you can use spots and get a couple other benefits. We use Qubole at Manthan for our Spark workloads. For ensuring all the nodes are ready, you could use yarn.minregisteredresourcesratio config property to ensure the execution doesn't start till the requisite containers

Re: How to unpersist a DStream in Spark Streaming

2015-11-05 Thread Tathagata Das
Spark streaming automatically takes care of unpersisting any RDDs generated by DStream. You can set the StreamingContext.remember() to set the minimum persistence duration. Any persisted RDD older than that will be automatically unpersisted On Thu, Nov 5, 2015 at 9:12 AM, swetha kasireddy

Re: Dump table into file

2015-11-05 Thread Hitoshi Ozawa
Have you tried using spark-csv? https://github.com/databricks/spark-csv e.g. hiveSQLContext.sql("FROM employee SELECT name, city, state").write.format("com.databricks.spark.csv").save("employee.csv") -- View this message in context:

cartesian in the loop, runtime grows

2015-11-05 Thread Faerman Evgeniy
Hi All, I have problem with cartesian product. I build cartesian of RDDs in the loop and update one of the variables in the iteration. At the end of the iteration the variable is squeezed to its original size. Therefore, I expect same running time for each iteration, because result of cartesian

Re: How to use data from Database and reload every hour

2015-11-05 Thread Adrian Tanase
You should look at .transform – it’s a powerful transformation (sic) that allows you to dynamically load resources and it gets executed in every micro batch. Re-broadcasting something should be possible from inside transform as that code is executed on the driver but it’s still a controversial

Subtract on rdd2 is throwing below exception

2015-11-05 Thread Priya Ch
Hi All, I am seeing exception when trying to substract 2 rdds. Lets say rdd1 has messages like - * pnr, bookingId, BookingObject* 101, 1, BookingObject1 // - event number is 0 102, 1, BookingObject2// - event number is 0 103, 2,

RE: JMX with Spark

2015-11-05 Thread Liu shen
Hi, This article may help you. Expose your counter through akka actor https://tersesystems.com/2014/08/19/exposing-akka-actor-state-with-jmx/ Sent from Mail for Windows 10 From: Yogesh Vyas Sent: 2015年11月5日 21:21 To: Romi Kuntsman Cc: user@spark.apache.org Subject: Re: JMX with Spark Hi,

Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Madabhattula Rajesh Kumar
Hi, Is this issue fixed in 1.5.1 version? Regards, Rajesh

How to use data from Database and reload every hour

2015-11-05 Thread Kay-Uwe Moosheimer
I have the following problem. We have MySQL and an Spark cluster. We need to load 5 different validation-instructions (several thousand of entries each) and use this information on the executors to decide if content from Kafka-Streaming is for process a or b. The streaming data from kafka are json