spark streaming application - deployment best practices

2016-06-15 Thread vimal dinakaran
Hi All, I am using spark-submit cluster mode deployment for my application to run it in production. But this places a requirement of having the jars in the same path in all the nodes and also the config file which is passed as argument in the same path. I am running spark in standalone mode and I

Re: Spark SQL driver memory keeps rising

2016-06-15 Thread Mich Talebzadeh
you will need to be more specific about how you are using these parameters. have you looked at spark WEB GUI (default port 4040) to see the jobs and stages. the amount of shuffle will also be given. also it helps if you do jps on OS and send the output of ps aux|grep ,PID> as well. What sort of

How to deal with tasks running too long?

2016-06-15 Thread Utkarsh Sengar
This SO question was asked about 1yr ago. http://stackoverflow.com/questions/31799755/how-to-deal-with-tasks-running-too-long-comparing-to-others-in-job-in-yarn-cli I answered this question with a suggestion to try speculation but it doesn't quite do what the OP expects. I have been running into

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Cassa L
Hi, I did set --driver-memory 4G. I still run into this issue after 1 hour of data load. I also tried version 1.6 in test environment. I hit this issue much faster than in 1.5.1 setup. LCassa On Tue, Jun 14, 2016 at 3:57 PM, Gaurav Bhatnagar wrote: > try setting the

Re: Is that possible to feed web request via spark application directly?

2016-06-15 Thread Peyman Mohajerian
There are a variety of REST API services you can use, but you must consider carefully whether it makes sense to start a Spark job based on individual requests, unless you mean based on some triggering event you want to start a Spark job, in which case it makes sense to use the RESTful service.

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Ashwin Raaghav
Thanks! That worked! :) And to read the files, I used pyspark.SparkFiles module. On Thu, Jun 16, 2016 at 7:12 AM, Sun Rui wrote: > have you tried > --files ? > > On Jun 15, 2016, at 18:50, ar7 wrote: > > > > I am using PySpark 1.6.1 for my spark

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi everyone, I added more logs for my use case: When I cached all my data 500 mil records and count. I receive this. 16/06/16 10:09:25 ERROR TaskSetManager: Total size of serialized results of 27 tasks (1876.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) >>> that weird because I

Is that possible to feed web request via spark application directly?

2016-06-15 Thread Yu Wei
Hi, I'm learning spark recently. I have one question about spark. Is it possible to feed web requests via spark application directly? Is there any library to be used? Or do I need to write the results from spark to HDFS/HBase? Is one spark application only to be designed to implement one

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta
Hi Mahender, please ensure that for dimension tables you are enabling the broadcast method. You must be able to see surprising gains @12x. Overall I think that SPARK cannot figure out whether to scan all the columns in a table or just the ones which are being used causing this issue. When you

RE: Spark SQL driver memory keeps rising

2016-06-15 Thread Mohammed Guller
It would be hard to guess what could be going on without looking at the code. It looks like the driver program goes into a long stop-the-world GC pause. This should not happen on the machine running the driver program if all that you are doing is reading data from HDFS, perform a bunch of

Re: streaming example has error

2016-06-15 Thread Lee Ho Yeung
got another error StreamingContext: Error starting the context, marking it as stopped /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 import org.apache.spark._ import org.apache.spark.streaming._ import

Re: GraphX performance and settings

2016-06-15 Thread Deepak Goel
I am not an expert but some thoughts inline On Jun 16, 2016 6:31 AM, "Maja Kabiljo" wrote: > > Hi, > > We are running some experiments with GraphX in order to compare it with other systems. There are multiple settings which significantly affect performance, and we

Re: java server error - spark

2016-06-15 Thread spR
hey, Thanks. Now it worked.. :) On Wed, Jun 15, 2016 at 6:59 PM, Jeff Zhang wrote: > Then the only solution is to increase your driver memory but still > restricted by your machine's memory. "--driver-memory" > > On Thu, Jun 16, 2016 at 9:53 AM, spR

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Then the only solution is to increase your driver memory but still restricted by your machine's memory. "--driver-memory" On Thu, Jun 16, 2016 at 9:53 AM, spR wrote: > Hey, > > But I just have one machine. I am running everything on my laptop. Won't I > be able to do

Re: java server error - spark

2016-06-15 Thread spR
Hey, But I just have one machine. I am running everything on my laptop. Won't I be able to do this processing in local mode then? Regards, Tejaswini On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang wrote: > You are using local mode, --executor-memory won't take effect for local

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Sun Rui
have you tried --files ? > On Jun 15, 2016, at 18:50, ar7 wrote: > > I am using PySpark 1.6.1 for my spark application. I have additional modules > which I am loading using the argument --py-files. I also have a h5 file > which I need to access from one of the modules for

RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Top of my head, I can think of the zip operation that RDD provides. So for example, if you have two DataFrames df1 and df2, you could do something like this: val newDF = df1.rdd.zip(df2.rdd).map { case(rowFromDf1, rowFromDf2) => ()}.toDF(...) Couple of things to keep in mind: 1)

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
You are using local mode, --executor-memory won't take effect for local mode, please use other cluster mode. On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang wrote: > Specify --executor-memory in your spark-submit command. > > > > On Thu, Jun 16, 2016 at 9:01 AM, spR

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Specify --executor-memory in your spark-submit command. On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: > Thank you. Can you pls tell How to increase the executor memory? > > > > On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote: > >> >>> Caused by:

Re: java server error - spark

2016-06-15 Thread spR
hey, I did this in my notebook. But still I get the same error. Is this the right way to do it? from pyspark import SparkConf conf = (SparkConf() .setMaster("local[4]") .setAppName("My app") .set("spark.executor.memory", "12g")) sc.conf = conf On Wed, Jun 15, 2016 at

Re: java server error - spark

2016-06-15 Thread spR
Thank you. Can you pls tell How to increase the executor memory? On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote: > >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > > > It is OOM on the executor. Please try to increase executor memory. >

GraphX performance and settings

2016-06-15 Thread Maja Kabiljo
Hi, We are running some experiments with GraphX in order to compare it with other systems. There are multiple settings which significantly affect performance, and we experimented a lot in order to tune them well. I'll share here what are the best we found so far and which results we got with

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded It is OOM on the executor. Please try to increase executor memory. "--executor-memory" On Thu, Jun 16, 2016 at 8:54 AM, spR wrote: > Hey, > > error trace - > > hey, > > > error trace - > > >

Re: java server error - spark

2016-06-15 Thread spR
Hey, error trace - hey, error trace - ---Py4JJavaError Traceback (most recent call last) in ()> 1 temp.take(2)

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian
Spark 1.6.1 is also using 1.7.0. Could you please share the schema of your Parquet file as well as the exact exception stack trace reported by Hive? Cheng On 6/13/16 12:56 AM, mayankshete wrote: Hello Team , I am facing an issue where output files generated by Spark 1.6.1 are not read by

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Could you paste the full stacktrace ? On Thu, Jun 16, 2016 at 7:24 AM, spR wrote: > Hi, > I am getting this error while executing a query using sqlcontext.sql > > The table has around 2.5 gb of data to be scanned. > > First I get out of memory exception. But I have 16 gb

Re: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
Any suggestions on this please On Wed, Jun 15, 2016 at 10:42 PM, VG wrote: > I have a very simple driver which loads a textFile and filters a >> sub-string from each line in the textfile. >> When the collect action is executed , I am getting an exception. (The >> file is

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
Hello, I am faced with problems when I try to run SparkPi.scala. I took the following steps below: a) git pull https://github.com/apache/spark b) Import the project in Intellij as a maven project c) Run 'SparkPi' Error Below: Information:16/06/16 01:34 - Compilation completed with 10 errors and 5

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
>>> I am seeing this issue too with pyspark (Using Spark 1.6.1). I have set spark.executor.cores to 1, but I see that whenever streaming batch starts processing data, see python -m pyspark.daemon processes increase gradually to about 5, (increasing CPU% on a box about 4-5 times, each

java server error - spark

2016-06-15 Thread spR
Hi, I am getting this error while executing a query using sqlcontext.sql The table has around 2.5 gb of data to be scanned. First I get out of memory exception. But I have 16 gb of ram Then my notebook dies and I get below error Py4JNetworkError: An error occurred while trying to connect to

unsubscribe

2016-06-15 Thread Sanjeev Sagar
unsubscribe

[ANNOUNCE] Apache SystemML 0.10.0-incubating released

2016-06-15 Thread Luciano Resende
The Apache SystemML team is pleased to announce the release of Apache SystemML version 0.10.0-incubating. Apache SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from

Re: concat spark dataframes

2016-06-15 Thread spR
Hey, There are quite a lot of fields. But, there are no common fields between the 2 dataframes. Can I not concatenate the 2 frames like we can do in pandas such that the resulting dataframe has columns from both the dataframes? Thank you. Regards, Misha On Wed, Jun 15, 2016 at 3:44 PM,

RE: Spark 2.0 release date

2016-06-15 Thread Mohammed Guller
Andy – instead of Naïve Bayes, you should have used the Multi-layer Perceptron classifier ☺ Mohammed Author: Big Data Analytics with Spark From: andy petrella [mailto:andy.petre...@gmail.com] Sent: Wednesday, June 15,

Re: Limit pyspark.daemon threads

2016-06-15 Thread Sudhir Babu Pothineni
Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16 core (virtual cores?), grid engine allocates 16 slots, If you use 'max' scheduling, it will send 16 processes sequentially to same machine, on the top of it each spark job has its own executors. Limit the number of jobs

RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Hi Misha, What is the schema for both the DataFrames? And what is the expected schema of the resulting DataFrame? Mohammed Author: Big Data Analytics with Spark From: Natu Lauchande [mailto:nlaucha...@gmail.com] Sent:

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Mahender Sarangam
+1, Even see performance degradation while comparing SPark SQL with Hive. We have table of 260 columns. We have executed in hive and SPARK. In Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time. On 6/9/2016 3:19 PM, Gavin Yue wrote: Could you print out the

Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
Yes have set spark.cores.max to 3. I have three worker nodes in my spark cluster (standalone mode), and spark.executor.cores is set to 1. On each worker node whenever the streaming application runs, I see 4 pyspark.daemon processes get spawned. Each pyspark.daemon process takes up approx 1 CPU

Re: processing 50 gb data using just one machine

2016-06-15 Thread Mich Talebzadeh
50gb of data is not much. besides master local[4] what else do you have for other parameters? ${SPARK_HOME}/bin/spark-submit \ --driver-memory 4G \ --num-executors 1 \ --executor-memory 4G \ --master local[4] \ Try running it

Re: What is the interpretation of Cores in Spark doc

2016-06-15 Thread Mich Talebzadeh
I think it is slightly more than that. These days software is licensed by core (generally speaking). That is the physical processor. * A core may have one or more threads - or logical processors*. Virtualization adds some fun to the mix. Generally what they present is ‘virtual processors’.

Re: choice of RDD function

2016-06-15 Thread Cody Koeninger
Doesn't that result in consuming each RDD twice, in order to infer the json schema? On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S wrote: > Of course :) > > object sparkStreaming { > def main(args: Array[String]) { > StreamingExamples.setStreamingLogLevels() //Set

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Cody, Are you referring to the val lines = messages.map(_._2)? Regards, Siva > On 15-Jun-2016, at 10:32 PM, Cody Koeninger wrote: > > Doesn't that result in consuming each RDD twice, in order to infer the > json schema? > > On Wed, Jun 15, 2016 at 11:19 AM,

RE: Limit pyspark.daemon threads

2016-06-15 Thread David Newberger
Have you tried setting spark.cores.max “When running on a standalone deploy cluster or a Mesos cluster in "coarse-grained" sharing mode, the maximum amount of

Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
Thx Gene! But my concern is with CPU usage not memory. I want to see if there is anyway to control the number of pyspark.daemon processes that get spawned. We have some restriction on number of CPU's we can use on a node, and number of pyspark.daemon processes that get created dont seem to honor

RE: restarting of spark streaming

2016-06-15 Thread Chen, Yan I
Could anyone answer my question? _ From: Chen, Yan I Sent: 2016, June, 14 1:34 PM To: 'user@spark.apache.org' Subject: restarting of spark streaming Hi, I notice that in the process of restarting, spark streaming will try to recover/replay all the

Re: concat spark dataframes

2016-06-15 Thread Natu Lauchande
Hi, You can select the common collumns and use DataFrame.union all . Regards, Natu On Wed, Jun 15, 2016 at 8:57 PM, spR wrote: > hi, > > how to concatenate spark dataframes? I have 2 frames with certain columns. > I want to get a dataframe with columns from both the

Re: Reporting warnings from workers

2016-06-15 Thread Ted Yu
Have you looked at: https://spark.apache.org/docs/latest/programming-guide.html#accumulators On Wed, Jun 15, 2016 at 1:24 PM, Mathieu Longtin wrote: > Is there a way to report warnings from the workers back to the driver > process? > > Let's say I have an RDD and do

ERROR TaskResultGetter: Exception while getting task result java.io.IOException: java.lang.ClassNotFoundException: scala.Some

2016-06-15 Thread S Sarkar
Hello, I built package for a spark application with the following sbt file: name := "Simple Project" version := "1.0" scalaVersion := "2.10.3" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.4.0" % "provided", "org.apache.spark" %% "spark-mllib"

Reporting warnings from workers

2016-06-15 Thread Mathieu Longtin
Is there a way to report warnings from the workers back to the driver process? Let's say I have an RDD and do this: newrdd = rdd.map(somefunction) In *somefunction*, I want to catch when there are invalid values in *rdd *and either put them in another RDD or send some sort of message back. Is

data too long

2016-06-15 Thread spR
I am trying to save a spark dataframe in the mysql database by using: df.write(sql_url, table='db.table') the first column in the dataframe seems too long and I get this error : Data too long for column 'custid' at row 1 what should I do? Thanks

Re: IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-06-15 Thread Arul Ramachandran
Hi Paolo, Were you able to get this resolved? I am hitting this issue, can you please share what was your solution. Thanks On Mon, Feb 15, 2016 at 7:49 PM, Paolo Villaflores wrote: > > Yes, I have sen that. But java.io.tmpdir has a default definition in > linux--it is

concat spark dataframes

2016-06-15 Thread spR
hi, how to concatenate spark dataframes? I have 2 frames with certain columns. I want to get a dataframe with columns from both the other frames. Regards, Misha

Re: spark standalone High availibilty issues

2016-06-15 Thread dhruve ashar
NoMethodFound seems that you are using incompatible versions of jars. Check your dependencies, they might be outdated. Updating the version or getting the right ones usually solves this issue. On Wed, Jun 15, 2016 at 9:04 AM, Jacek Laskowski wrote: > Can you post the error? >

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-15 Thread swetha kasireddy
Hi Mich, No I have not tried that. My requirement is to insert that from an hourly Spark Batch job. How is it different by trying to insert with Hive CLI or beeline? Thanks, Swetha On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh wrote: > Hi Swetha, > > Have you

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
Thanks! got that. I was worried about the time itself. On Wed, Jun 15, 2016 at 10:10 AM, Sergio Fernández wrote: > In theory yes... the common sense say that: > > volume / resources = time > > So more volume on the same processing resources would just take more time. > On Jun

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
I meant local mode is testing purpose generally. But, I have to use the entire 50gb data. On Wed, Jun 15, 2016 at 10:14 AM, Deepak Goel wrote: > If it is just for test purpose, why not use a smaller size of data and > test it on your notebook. When you go for the cluster, you

Re: processing 50 gb data using just one machine

2016-06-15 Thread Deepak Goel
If it is just for test purpose, why not use a smaller size of data and test it on your notebook. When you go for the cluster, you can go for 50GB (I am a newbie so my thought would be very naive) Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833 www.simtree.net,

Fwd: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
> > I have a very simple driver which loads a textFile and filters a > sub-string from each line in the textfile. > When the collect action is executed , I am getting an exception. (The > file is only 90 MB - so I am confused what is going on..) I am running on a > local standalone cluster > >

Re: processing 50 gb data using just one machine

2016-06-15 Thread Sergio Fernández
In theory yes... the common sense say that: volume / resources = time So more volume on the same processing resources would just take more time. On Jun 15, 2016 6:43 PM, "spR" wrote: > I have 16 gb ram, i7 > > Will this config be able to handle the processing without my

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
I have 16 gb ram, i7 Will this config be able to handle the processing without my ipythin notebook dying? The local mode is for testing purpose. But, I do not have any cluster at my disposal. So can I make this work with the configuration that I have? Thank you. On Jun 15, 2016 9:40 AM, "Deepak

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Of course :) object sparkStreaming { def main(args: Array[String]) { StreamingExamples.setStreamingLogLevels() //Set reasonable logging levels for streaming if the user has not configured log4j. val topics = "test" val brokers = "localhost:9092" val topicsSet =

Re: update mysql in spark

2016-06-15 Thread Cheng Lian
Spark SQL doesn't support update command yet. On Wed, Jun 15, 2016, 9:08 AM spR wrote: > hi, > > can we write a update query using sqlcontext? > > sqlContext.sql("update act1 set loc = round(loc,4)") > > what is wrong in this? I get the following error. > >

update mysql in spark

2016-06-15 Thread spR
hi, can we write a update query using sqlcontext? sqlContext.sql("update act1 set loc = round(loc,4)") what is wrong in this? I get the following error. Py4JJavaError: An error occurred while calling o20.sql. : java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier update

Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi, Good to hear so! Mind sharing a few snippets of your solution? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S

processing 50 gb data using just one machine

2016-06-15 Thread spR
Hi, can I use spark in local mode using 4 cores to process 50gb data effeciently? Thank you misha

Re: Is that normal spark performance?

2016-06-15 Thread Deepak Goel
I am not an expert, but it seems all your processing is done on node1 while node2 is lying idle Hey Namaskara~Nalama~Guten Tag~Bonjour -- Keigu Deepak 73500 12833 www.simtree.net, dee...@simtree.net deic...@gmail.com LinkedIn: www.linkedin.com/in/deicool Skype: thumsupdeicool Google talk:

vecotors inside columns

2016-06-15 Thread pseudo oduesp
hi , i want ask question about vector.dense or spars : imagine i have dataframe with columns and one of them contain vectors . my question can i give this columns to machine learning algorithmes like one value ? df.col1 | df.col2 | 1 | (1,[2],[3] ,[] ...[6]) 2 | (1,[5],[3]

Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Thanks Jacek, Job completed!! :) Just used data frames and sql query. Very clean and functional code. Siva > On 15-Jun-2016, at 3:10 PM, Jacek Laskowski wrote: > > mapWithState

Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
Yeah well... the prior was high... but don't have enough data on Mich to have an accurate likelihood :-) But ok, my bad, I continue with the preview stuff and leave this thread in peace ^^ tx ted cheers On Wed, Jun 15, 2016 at 4:47 PM Ted Yu wrote: > Andy: > You should

Re: Spark 2.0 release date

2016-06-15 Thread Ted Yu
Andy: You should sense the tone in Mich's response. To my knowledge, there hasn't been an RC for the 2.0 release yet. Once we have an RC, it goes through the normal voting process. FYI On Wed, Jun 15, 2016 at 7:38 AM, andy petrella wrote: > > tomorrow lunch time >

Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
> tomorrow lunch time Which TZ :-) → I'm working on the update of some materials that Dean Wampler and myself will give tomorrow at Scala Days (well tomorrow CEST). Hence, I'm upgrading the materials on spark

Get both feature importance and ROC curve from a random forest classifier

2016-06-15 Thread matd
Hi ml folks ! I'm using a Random Forest for a binary classification. I'm interested in getting both the ROC *curve* and the feature importance from the trained model. If I'm not missing something obvious, the ROC curve is only available in the old mllib world, via BinaryClassificationMetrics. In

Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi, Ad Q1, yes. See stateful operators like mapWithState and windows. Ad Q2, RDDs should be fine (and available out of the box), but I'd give Datasets a try too since they're .toDF away. Jacek On 14 Jun 2016 10:29 p.m., "Sivakumaran S" wrote: Dear friends, I have set up

Re: Is that normal spark performance?

2016-06-15 Thread Jörn Franke
What Volume do you have? Why do not you use the corresponding Cassandra functionality directly? If you do it once and not iteratively in-memory you cannot expect so much improvement > On 15 Jun 2016, at 16:01, nikita.dobryukha wrote: > > We use Cassandra 3.5 + Spark

Re: spark standalone High availibilty issues

2016-06-15 Thread Jacek Laskowski
Can you post the error? Jacek On 14 Jun 2016 10:56 p.m., "Darshan Singh" wrote: > Hi, > > I am using standalone spark cluster and using zookeeper cluster for the > high availbilty. I am getting sometimes error when I start the master. The > error is related to Leader

Re: can not show all data for this table

2016-06-15 Thread Mich Talebzadeh
at last some progress :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 15 June 2016 at 10:52, Lee

Is that normal spark performance?

2016-06-15 Thread nikita.dobryukha
We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster (8 cores and 1g memory per node). There is the following Cassandra tableAnd I want to calculate percentage of volume: sum of all volume from trades in the relevant security during the time period groupped by exchange and time bar (1 or 5

Re: Spark 2.0 release date

2016-06-15 Thread Mich Talebzadeh
Tomorrow lunchtime. Btw can you stop spamming every big data forum about good interview questions book for big data! I have seen your mails on this big data book in spark, hive and tez forums and I am sure there are many others. That seems to be the only mail you send around. This forum is for

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
Hi Yogesh, I'm not sure if this is possible or not. I'd be interested in knowing. My gut thinks it would be an anti-pattern if it's possible to do something like this and that's why I handle it in either the foreachRDD or foreachPartition. The way I look at spark streaming is as an application

Re: Limit pyspark.daemon threads

2016-06-15 Thread Gene Pang
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory, and you can then share that RDD across different jobs. If you would like to run Spark on Alluxio, this documentation can help: http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html Thanks, Gene On

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi Gene, I am using Alluxio 1.1.0. Spark 2.0 Preview version. Load from alluxio then cached and query for 2nd time. Spark will stuck. > On Jun 15, 2016, at 8:42 PM, Gene Pang wrote: > > Hi, > > Which version of Alluxio are you using? > > Thanks, > Gene > > On Tue, Jun

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Gene Pang
Hi, Which version of Alluxio are you using? Thanks, Gene On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le wrote: > I am testing Spark 2.0 > I load data from alluxio and cached then I query but the first query is ok > because it kick off cache action. But after that I run the

Substract two DStreams

2016-06-15 Thread Matthias Niehoff
Hi, i want to subtract 2 DStreams (based on the same Input Stream) to get all elements that exist in the original stream, but not in the modified stream (the modified Stream is changed using joinWithCassandraTable which does an inner join and because of this might remove entries). Subtract is

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
If you're asking how to handle no messages in a batch window then I would add an isEmpty check like: dStream.foreachRDD(rdd => { if (!rdd.isEmpty()) ... } Or something like that. David Newberger -Original Message- From: Yogesh Vyas [mailto:informy...@gmail.com] Sent: Wednesday,

RE: streaming example has error

2016-06-15 Thread David Newberger
Have you tried to “set spark.driver.allowMultipleContexts = true”? David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Tuesday, June 14, 2016 8:34 PM To: user@spark.apache.org Subject: streaming example has error when simulate streaming with nc -lk got error below, then

Spark 2.0 release date

2016-06-15 Thread Chaturvedi Chola
when is the spark 2.0 release planned

Handle empty kafka in Spark Streaming

2016-06-15 Thread Yogesh Vyas
Hi, Does anyone knows how to handle empty Kafka while Spark Streaming job is running ? Regards, Yogesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Adding h5 files in a zip to use with PySpark

2016-06-15 Thread ar7
I am using PySpark 1.6.1 for my spark application. I have additional modules which I am loading using the argument --py-files. I also have a h5 file which I need to access from one of the modules for initializing the ApolloNet. Is there any way I could access those files from the modules if I put

can spark help to prevent memory error for itertools.combinations(initlist, 2) in python script

2016-06-15 Thread Lee Ho Yeung
i write a python script which has itertools.combinations(initlist, 2) but it got error when number of elements in initlist over 14,000 is it possible to use spark to do this work? i have seen yatel can do this, is spark and yatel using hard disk as memory? if so, which need to change in

Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich, i find my problem cause now, i missed setting delimiter which is tab, but it got error, and i notice that only libre office and open and read well, even if Excel in window, it still can not separate in well format scala> val df =

Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-15 Thread Mirko
Hi All, I’m using Spark 1.6.1 and I’m getting the error below. This appear also with the current branch 1.6 The code that is generating the error is loading a table from MsSql server. sqlContext.read.format("jdbc").options(options).load() I’ve also looked if the microsoft jdbc driver is loaded

how do I set TBLPROPERTIES in dataFrame.saveAsTable()?

2016-06-15 Thread Yang
I tried df.options(MAP(prop_name->prop_value)).saveAsTable(tb_name) doesn't seem to work thanks a lot!

Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich, https://drive.google.com/file/d/0Bxs_ao6uuBDUQ2NfYnhvUl9EZXM/view?usp=sharing https://drive.google.com/file/d/0Bxs_ao6uuBDUS1UzTWd1Q2VJdEk/view?usp=sharing this time I ensure headers cover all data, only some columns which have headers do not have data but still can not show all data