Re: Is this likely to cause any problems?

2016-02-19 Thread Sabarish Sasidharan
EMR does cost more than vanilla EC2. Using spark-ec2 can result in savings with large clusters, though that is not everybody's cup of tea. Regards Sab On 19-Feb-2016 7:55 pm, "Daniel Siegmann" wrote: > With EMR supporting Spark, I don't see much reason to use the

Re: Communication between two spark streaming Job

2016-02-19 Thread Chris Fregly
if you need update notifications, you could introduce ZooKeeper (eek!) or a Kafka queue between the jobs. I've seen internal Kafka queues (relative to external spark streaming queues) used for this type of incremental update use case. think of the updates as transaction logs. > On Feb 19,

Re: Communication between two spark streaming Job

2016-02-19 Thread Ted Yu
Have you considered using a Key Value store which is accessible to both jobs ? The communication would take place through this store. Cheers On Fri, Feb 19, 2016 at 11:48 AM, Ashish Soni wrote: > Hi , > > Is there any way we can communicate across two different spark

Re: Submitting Jobs Programmatically

2016-02-19 Thread Arko Provo Mukherjee
Hello, Thanks much. I could start the service. When I run my program, the launcher is not being able to find the app class: java.lang.ClassNotFoundException: SparkSubmitter at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at

Re: Submitting Jobs Programmatically

2016-02-19 Thread Ted Yu
Cycling old bits: http://search-hadoop.com/m/q3RTtHrxMj2abwOk2 On Fri, Feb 19, 2016 at 6:40 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi, > > Thanks for your response. Is there a similar link for Windows? I am > not sure the .sh scripts would run on windows. > > My

Re: Submitting Jobs Programmatically

2016-02-19 Thread Arko Provo Mukherjee
Hi, Thanks for your response. Is there a similar link for Windows? I am not sure the .sh scripts would run on windows. My default the start-all.sh doesn't work and I don't see anything in localhos:8080 I will do some more investigation and come back. Thanks again for all your help! Thanks &

Re: Submitting Jobs Programmatically

2016-02-19 Thread Ted Yu
Please see https://spark.apache.org/docs/latest/spark-standalone.html On Fri, Feb 19, 2016 at 6:27 PM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hi, > > Thanks for your response, that really helped. > > However, I don't believe the job is being submitted. When I run spark >

Re: Submitting Jobs Programmatically

2016-02-19 Thread Arko Provo Mukherjee
Hi, Thanks for your response, that really helped. However, I don't believe the job is being submitted. When I run spark from the shell, I don't need to start it up explicitly. Do I need to start up Spark on my machine before running this program? I see the following in the SPARK_HOME\bin

Re: Submitting Jobs Programmatically

2016-02-19 Thread Holden Karau
How are you trying to launch your application? Do you have the Spark jars on your class path? On Friday, February 19, 2016, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I am trying to submit a spark job via a program. > > When I run it, I receive the following error:

Submitting Jobs Programmatically

2016-02-19 Thread Arko Provo Mukherjee
Hello, I am trying to submit a spark job via a program. When I run it, I receive the following error: Exception in thread "Thread-1" java.lang.NoClassDefFoundError: org/apache/spark/launcher/SparkLauncher at Spark.SparkConnector.run(MySpark.scala:33) at

Re: Access to broadcasted variable

2016-02-19 Thread Shixiong(Ryan) Zhu
The broadcasted object is serialized in driver and sent to the executors. And in the executor, it will deserialize the bytes to get the broadcasted object. On Fri, Feb 19, 2016 at 5:54 AM, jeff saremi wrote: > could someone please comment on this? thanks > >

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
This simple example works for me, it prints out the updated model centers. I'm running from the master branch. val sc = new SparkContext("local[2]", "test") val ssc = new StreamingContext(sc, Seconds(1)) val kMeans = new StreamingKMeans() .setK(2)

Re: UDAF support for DataFrames in Spark 1.5.0?

2016-02-19 Thread Richard Cobbe
On Thu, Feb 18, 2016 at 11:18:44PM +, Kabeer Ahmed wrote: > I use Spark 1.5 with CDH5.5 distribution and I see that support is > present for UDAF. From the link: > https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html,

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
Also the cluster centroid I get in streaming mode (some with negative values) do not make sense - if I use the same data and run in batch KMeans.train(sc.parallelize(parsedData), numClusters, numIterations) cluster centers are what you would expect. Krishna On Fri, Feb 19, 2016 at 12:49 PM,

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
ok i will share a simple example soon. meantime you will be able to see this behavior using example here, https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala slightly modify it to include

Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-19 Thread roni
Thanks Felix . I tried that but I still get the same error . Am I missing something? dds <- DESeqDataSetFromMatrix(countData = collect(countMat), colData = collect(colData), design = design) Error in DataFrame(colData, row.names = rownames(colData)) : cannot coerce class "data.frame" to a

Re: How to train and predict in parallel via Spark MLlib?

2016-02-19 Thread Xiangrui Meng
I put a simple example here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3877825096667927/588180/d9d264e39a.html On Thu, Feb 18, 2016 at 6:47 AM Игорь Ляхов wrote: > Xiangrui, thnx for your answer! > Could you

Re: equalTo isin not working as expected with a constructed column with DataFrames

2016-02-19 Thread Michael Armbrust
Can you include the output of explain(true) on the dataframe in question. It would also be really helpful to see a small code fragment that reproduces the issue. On Thu, Feb 18, 2016 at 9:10 AM, Mehdi Ben Haj Abbes wrote: > Hi, > I forgot to mention that I'm using the

Re: Spark Job Hanging on Join

2016-02-19 Thread Michael Armbrust
Please include the output of running explain() when reporting performance issues with DataFrames. On Fri, Feb 19, 2016 at 9:31 AM, Tamara Mendt wrote: > Hi all, > > I am running a Spark job that gets stuck attempting to join two > dataframes. The dataframes are not very

Re: Is this likely to cause any problems?

2016-02-19 Thread Nicholas Chammas
The docs mention spark-ec2 because it is part of the Spark project. There are many, many alternatives to spark-ec2 out there like EMR, but it's probably not the place of the official docs to promote any one of those third-party solutions. On Fri, Feb 19, 2016 at 11:05 AM James Hammerton

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
Can you share more of your code to reproduce this issue? The model should be updated with each batch, but can't tell what is happening from what you posted so far. On Fri, Feb 19, 2016 at 10:40 AM, krishna ramachandran wrote: > Hi Bryan > Agreed. It is a single statement to

RE: Spark JDBC connection - data writing success or failure cases

2016-02-19 Thread Mich Talebzadeh
agreed Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread krishna ramachandran
Hi Bryan Agreed. It is a single statement to print the centers once for *every* streaming batch (4 secs) - remember this is in streaming mode and the receiver has fresh data every batch. That is, as the model is trained continuously so I expect the centroids to change with incoming streams (at

Re: StreamingKMeans does not update cluster centroid locations

2016-02-19 Thread Bryan Cutler
Could you elaborate where the issue is? You say calling model.latestModel.clusterCenters.foreach(println) doesn't show an updated model, but that is just a single statement to print the centers once.. Also, is there any reason you don't predict on the test data like this?

Spark Job Hanging on Join

2016-02-19 Thread Tamara Mendt
Hi all, I am running a Spark job that gets stuck attempting to join two dataframes. The dataframes are not very large, one is about 2 M rows, and the other a couple of thousand rows and the resulting joined dataframe should be about the same size as the smaller dataframe. I have tried triggering

Re: Meetup in Rome

2016-02-19 Thread Denny Lee
Hey Domenico, Glad to hear that you love Spark and would like to organize a meetup in Rome. We created a Meetup-in-a-box to help with that - check out the post https://databricks.com/blog/2015/11/19/meetup-in-a-box.html. HTH! Denny On Fri, Feb 19, 2016 at 02:38 Domenico Pontari

Re: Spark JDBC connection - data writing success or failure cases

2016-02-19 Thread Russell Jurney
Oracle is a perfectly reasonable endpoint for publishing data processed in Spark. I've got to assume he's using it that way and not as a stand in for HDFS? On Friday, February 19, 2016, Jörn Franke wrote: > Generally oracle db should not be used as a storage layer for

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
Sure. These may be unrelated. On Fri, Feb 19, 2016 at 10:39 AM, Jerry Lam wrote: > Hi guys, > > I also encounter broadcast dataframe issue not for steaming jobs but > regular dataframe join. In my case, the executors died probably due to OOM > which I don't think it should

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
Hmmm..OK. Srikanth On Fri, Feb 19, 2016 at 10:20 AM, Sebastian Piu wrote: > I don't have the code with me now, and I ended moving everything to RDD in > the end and using map operations to do some lookups, i.e. instead of > broadcasting a Dataframe I ended broadcasting

Spark Random Forest Memory issues

2016-02-19 Thread Ewan Higgs
Hi all, Back in september there was a bunch of machine learning profile results published here: https://github.com/szilard/benchm-ml/ Spark's Random Forest seemed to fall down with memory issues at about 10m entries: https://github.com/szilard/benchm-ml/blob/master/2-rf/5c-spark-crash.txt

Re: install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
great thank you On Friday, 19 February 2016, 15:33, Holden Karau wrote: So with --packages to spark-shell and spark-submit Spark will automatically fetch the requirements from maven. If you want to use an explicit local jar you can do that with the --jars

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton
Hi, Having looked at how easy it is to use EMR, I reckon you may be right, especially if using Java 8 is no more difficult with that than with spark-ec2 (where I had to install it on the master and slaves and edit the spark-env.sh). I'm now curious as to why the Spark documentation (

RE: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-19 Thread Mich Talebzadeh
Thanks very helpful indeed Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential.

Re: Streaming with broadcast joins

2016-02-19 Thread Jerry Lam
Hi guys, I also encounter broadcast dataframe issue not for steaming jobs but regular dataframe join. In my case, the executors died probably due to OOM which I don't think it should use that much memory. Anyway, I'm going to craft an example and send it here to see if it is a bug or something

Re: Spark stream job is take up /TMP with 100%

2016-02-19 Thread Holden Karau
Thats a good question, you can find most of what you are looking for in the configuration guide at http://spark.apache.org/docs/latest/configuration.html - you probably want to change the spark.local.dir to point to your scratch directory. Out of interest what problems have you been seeing with

Re: install databricks csv package for spark

2016-02-19 Thread Holden Karau
So with --packages to spark-shell and spark-submit Spark will automatically fetch the requirements from maven. If you want to use an explicit local jar you can do that with the --jars syntax. You might find http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19,

install databricks csv package for spark

2016-02-19 Thread Ashok Kumar
Hi, I downloaded the zipped csv libraries from databricks/spark-csv |   | |   | |   |   |   |   |   | | databricks/spark-csvspark-csv - CSV data source for Spark SQL and DataFrames | | | | View on github.com | Preview by Yahoo | | | |   | Now I have a directory created called

Re: Streaming with broadcast joins

2016-02-19 Thread Sebastian Piu
I don't have the code with me now, and I ended moving everything to RDD in the end and using map operations to do some lookups, i.e. instead of broadcasting a Dataframe I ended broadcasting a Map On Fri, Feb 19, 2016 at 11:39 AM Srikanth wrote: > It didn't fail. It

Spark stream job is take up /TMP with 100%

2016-02-19 Thread Sutanu Das
We have a Spark steaming job and when running in LOCAL mode, it takes up /TMP at 100% and Fails with error below, this doesn't happen in YARN Mode but in YARN we have performance issues. How can I re-direct Spark Local Subffle from /TMP to /other_filesystem_location (where we have lots of

Re: How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, class body thanks On Friday, 19 February 2016, 11:23, Ted Yu wrote: Can you clarify your question ? Did you mean the body of your class ? On Feb 19, 2016, at 4:43 AM, Ashok Kumar wrote: Hi, If I define a class in Scala like case

Re: Is this likely to cause any problems?

2016-02-19 Thread Daniel Siegmann
With EMR supporting Spark, I don't see much reason to use the spark-ec2 script unless it is important for you to be able to launch clusters using the bleeding edge version of Spark. EMR does seem to do a pretty decent job of keeping up to date - the latest version (4.3.0) supports the latest Spark

RE: Access to broadcasted variable

2016-02-19 Thread jeff saremi
could someone please comment on this? thanks From: jeffsar...@hotmail.com To: user@spark.apache.org Subject: Access to broadcasted variable Date: Thu, 18 Feb 2016 14:44:07 -0500 I'd like to know if the broadcasted object gets serialized when accessed by the executor during the execution

Adding vertex to a graph in graphx is taking more time in subsequent addition

2016-02-19 Thread Udbhav Agarwal
Hi , I am adding bunch of vertices in a graph in graphx using the following method . I am facing the problem of latency. First time an addition of say 400 vertices to a graph with 100,000 nodes takes around 7 seconds. next time its taking 15 seconds. So every subsequent adds are taking more

Re: listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-19 Thread Srikanth
Apparently you can pass comma separated folders. Try the suggestion given here --> http://stackoverflow.com/questions/29426246/spark-streaming-textfilestream-not-supporting-wildcards Let me know if this helps Srikanth On Wed, Feb 17, 2016 at 5:47 PM, Shixiong(Ryan) Zhu

Re: Accessing Web UI

2016-02-19 Thread Eduardo Costa Alfaia
Hi, try http://OAhtvJ5MCA:8080 BR On 2/19/16, 07:18, "vasbhat" wrote: >OAhtvJ5MCA -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail:

an error when I read data from parquet

2016-02-19 Thread AlexModestov
Hello everybody, I use Python API and Scala API. I read data without problem with Python API: "sqlContext = SQLContext(sc) data_full = sqlContext.read.parquet("---")" But when I use Scala: "val sqlContext = new SQLContext(sc) val data_full = sqlContext.read.parquet("---")" I get the error (I

Re: Accessing Web UI

2016-02-19 Thread Gourav Sengupta
can you please try localhost:8080? Regards, Gourav Sengupta On Fri, Feb 19, 2016 at 11:18 AM, vasbhat wrote: > Hi, > >I have installed the spark1.6 and trying to start the master > (start-master.sh) and access the webUI. > > I get the following logs on running the

Re: Streaming with broadcast joins

2016-02-19 Thread Srikanth
It didn't fail. It wasn't broadcasting. I just ran the test again and here are the logs. Every batch is reading the metadata file. 16/02/19 06:27:02 INFO HadoopRDD: Input split: file:/shared/data/test-data.txt:0+27 16/02/19 06:27:02 INFO HadoopRDD: Input split:

Re: How to get the code for class in spark

2016-02-19 Thread Ted Yu
Can you clarify your question ? Did you mean the body of your class ? > On Feb 19, 2016, at 4:43 AM, Ashok Kumar wrote: > > Hi, > > If I define a class in Scala like > > case class(col1: String, col2:Int,...) > > and it is created how would I be able to see its

Re: Accessing Web UI

2016-02-19 Thread vasbhat
Hi, I have installed the spark1.6 and trying to start the master (start-master.sh) and access the webUI. I get the following logs on running the start-master.sh Spark Command: /usr/jdk/instances/jdk1.8.0/jre/bin/java -cp

Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-19 Thread UMESH CHAUDHARY
If I understood correctly, you can have many sub-dirs under *hdfs:///TestDirectory *and and you need to attach a schema to all part files in a sub-dir. 1) I am assuming that you know the sub-dirs names : For that, you need to list all sub-dirs inside *hdfs:///TestDirectory *using Scala,

Re: Submit custom python packages from current project

2016-02-19 Thread Eike von Seggern
Hello, 2016-02-16 11:03 GMT+01:00 Mohannad Ali : > Hello Everyone, > > I have code inside my project organized in packages and modules, however I > keep getting the error "ImportError: No module named " when > I run spark on YARN. > > My directory structure is something like

Re: Spark History Server NOT showing Jobs with Hortonworks

2016-02-19 Thread Steve Loughran
this is set up to save history to the timeline service, something which works provided the applications are all set up to publish there too. On 18 Feb 2016, at 22:22, Sutanu Das > wrote: Hi Community, Challenged with Spark issues with Hortonworks (HDP

Meetup in Rome

2016-02-19 Thread Domenico Pontari
Hi guys, I spent till September 2015 in the bay area working with Spark and I love it. Now I'm back to Rome and I'd like to organize a meetup about it and Big Data in general. Any idea / suggestions? Can you eventually sponsor beers and pizza for it? Best, Domenico

Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-19 Thread Chandeep Singh
You might be better off using the CSV loader in this case. https://github.com/databricks/spark-csv Input: [csingh ~]$ hadoop fs -cat test.csv 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00” and here is quick ad dirty way to resolve your issue.. val df =

Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-19 Thread UMESH CHAUDHARY
My CSV: *name,checked-in,booking_cost* AC,true,1200 BK,false,0 DDC,true,1200 I have done: val textFile=sc.textFile("/home/user/sampleCSV.txt") val schemaString="name,checked-in,booking_cost" import org.apache.spark.sql.Row; import

Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-19 Thread Divya Gehlot
Hi, I have a use case ,where I have one parent directory File stucture looks like hdfs:///TestDirectory/spark1/part files( created by some spark job ) hdfs:///TestDirectory/spark2/ part files (created by some spark job ) spark1 and spark 2 has different schema like spark 1 part files schema

Re: subtractByKey increases RDD size in memory - any ideas?

2016-02-19 Thread DaPsul
That could be possible but if you extract the data and create a new RDD the size is still bigger: val data = rdd3.collect() val rdd4 = sc.paralellize(data) Am 19/02/16 um 02:32 schrieb Andrew Ehrlich: There could be clues in the different RDD subclasses; rdd1 is ParallelCollectionRDD but

How to get the code for class in spark

2016-02-19 Thread Ashok Kumar
Hi, If I define a class in Scala like case class(col1: String, col2:Int,...) and it is created how would I be able to see its description anytime Thanks

RE: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-19 Thread Mich Talebzadeh
Ok I have created a one liner csv file as follows: cat testme.csv 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00" I use the following in Spark to split it csv=sc.textFile("/data/incoming/testme.csv") csv.map(_.split(",")).first res159: Array[String] = Array(360, 10/02/2014, "?2,

Re: Streaming with broadcast joins

2016-02-19 Thread Sebastian Piu
I don't see anything obviously wrong on your second approach, I've done it like that before and it worked. When you say that it didn't work what do you mean? did it fail? it didnt broadcast? On Thu, Feb 18, 2016 at 11:43 PM Srikanth wrote: > Code with SQL broadcast hint.

Re: Logistic Regression using ML Pipeline

2016-02-19 Thread Ajinkya Kale
Please take a look at the example here http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Thu, Feb 18, 2016 at 9:27 PM Arunkumar Pillai wrote: > Hi > > I'm trying to build logistic regression using ML Pipeline > > val lr = new LogisticRegression() >

Re: Spark JDBC connection - data writing success or failure cases

2016-02-19 Thread Jörn Franke
Generally oracle db should not be used as a storage layer for spark due to performance reasons. You should consider HDFS. This will help you also with fault - tolerance. > On 19 Feb 2016, at 03:35, Divya Gehlot wrote: > > Hi, > I am a Spark job which connects to