Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
Yes you are right I initially started from master node but what happened suddenly after 2 days that workers dies is what I am interested in knowing , is it possible that workers got disconnected because of some network issue and then they tried tried starting themselves but kept failing ? On Sun,

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
Kartik, Spark Workers won't start if SPARK_MASTER_IP is wrong, maybe you would have used start_slaves.sh from Master node to start all worker nodes, where Workers would have got correct SPARK_MASTER_IP initially. Later any restart from slave nodes would have failed because of wrong

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
Thanks Prabhu , I had wrongly configured spark_master_ip in worker nodes to `hostname -f` which is the worker and not master , but now the question is *why the cluster was up initially for 2 days* and workers realized of this invalid configuration after 2 days ? And why other workers are still

Re: Spark worker abruptly dying after 2 days

2016-02-14 Thread Prabhu Joseph
Kartik, The exception stack trace *java.util.concurrent.RejectedExecutionException* will happen if SPARK_MASTER_IP in worker nodes are configured wrongly like if SPARK_MASTER_IP is a hostname of Master Node and workers trying to connect to IP of master node. Check whether SPARK_MASTER_IP in

Spark worker abruptly dying after 2 days

2016-02-14 Thread Kartik Mathur
on spark 1.5.2 I have a spark standalone cluster with 6 workers , I left the cluster idle for 3 days and after 3 days I saw only 4 workers on the spark master UI , 2 workers died with the same exception - Strange part is cluster was running stable for 2 days but on third day 2 workers abruptly

How to join an RDD with a hive table?

2016-02-14 Thread SRK
Hi, How to join an RDD with a hive table and retrieve only the records that I am interested. Suppose, I have an RDD that has 1000 records and there is a Hive table with 100,000 records, I should be able to join the RDD with the hive table by an Id and I should be able to load only those 1000

Re: How to query a hive table from inside a map in Spark

2016-02-14 Thread Alex Kozlov
While this is possible via jdbc calls, it is not the best practice: you should probably use variable broadcasting instead. On Sun, Feb 14, 2016 at 8:40 PM, SRK wrote: > Hi, > > Is it

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Alex Kozlov
Praveen, the mode in which you run spark (standalone, yarn, mesos) is determined when you create SparkContext . You are right that spark-submit and spark-shell create different SparkContexts. In general,

Re: Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Alexander Pivovarov
Consider streaming for real time cases http://zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/ On Sun, Feb 14, 2016 at 7:28 PM, Divya Gehlot wrote: > Hi, > I would like to know difference between spark-shell and spark-submit in > terms of real time

Unable to insert overwrite table with Spark 1.5.2

2016-02-14 Thread Ramanathan R
Hi All, Spark 1.5.2 does not seem to be backward compatible with functionality that was available in earlier versions, at least in 1.3.1 and 1.4.1. It is not possible to insert overwrite into an existing table that was read as a DataFrame initially. Our existing code base has few internal Hive

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread praveen S
Even i was trying to launch spark jobs from webservice : But I thought you could run spark jobs in yarn mode only through spark-submit. Is my understanding not correct? Regards, Praveen On 15 Feb 2016 08:29, "Sabarish Sasidharan" wrote: > Yes you can look at

Re: which master option to view current running job in Spark UI

2016-02-14 Thread Sabarish Sasidharan
When running in YARN, you can use the YARN Resource Manager UI to get to the ApplicationMaster url, irrespective of client or cluster mode. Regards Sab On 15-Feb-2016 10:10 am, "Divya Gehlot" wrote: > Hi, > I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as

How to query a hive table from inside a map in Spark

2016-02-14 Thread SRK
Hi, Is it possible to query a hive table which has data stored in the form of a parquet file from inside map/partitions in Spark? My requirement is that I have a User table in Hive/hdfs and for each record inside a sessions RDD, I should be able to query the User table and if the User table

which master option to view current running job in Spark UI

2016-02-14 Thread Divya Gehlot
Hi, I have Hortonworks 2.3.4 cluster on EC2 and Have spark jobs as scala files . I am bit confused between using *master *options I want to execute this spark job in YARN Curently running as spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars

Re: support vector machine does not classify properly?

2016-02-14 Thread prem09
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/support-vector-machine-does-not-classify-properly-tp26216p26223.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
For YARN mode, you can set --executor-cores 1 -Original Message- From: Sun, Rui [mailto:rui@intel.com] Sent: Monday, February 15, 2016 11:35 AM To: Simon Hafner ; user Subject: RE: Running synchronized JRI code Yes, JRI loads an R

Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-14 Thread Kevin Burton
Afternoon. About 6 months ago I tried (and failed) to get Spark and Cassandra working together in production due to dependency hell. I'm going to give it another try! Here's my general strategy. I'm going to create a maven module for my code... with spark dependencies. Then I'm going to get

RE: Running synchronized JRI code

2016-02-14 Thread Sun, Rui
Yes, JRI loads an R dynamic library into the executor JVM, which faces thread-safe issue when there are multiple task threads within the executor. If you are running Spark on Standalone mode, it is possible to run multiple workers per node, and at the same time, limit the cores per worker to be

Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Divya Gehlot
Hi, I would like to know difference between spark-shell and spark-submit in terms of real time scenarios. I am using Hadoop cluster with Spark on EC2. Thanks, Divya

Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Deng Ching-Mallete
Hi Mich, For the --jars parameter, just pass in the jars as comma-delimited. As for the --driver-class-path, make it colon-delimited -- similar to how you set multiple paths for an environment variable (e.g. --driver-class-path /home/hduser/jars/jconn4.jar:/home/hduse/jars/ojdbc6.jar). But if

Re: IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Saisai Shao
Hi Divya, Would you please provide full stack of exception? From my understanding --executor-cores should be worked, we could know better if you provide the full stack trace. The performance relies on many different aspects, I'd recommend you to check the spark web UI to know the application

Re: coalesce and executor memory

2016-02-14 Thread Sabarish Sasidharan
I believe you will gain more understanding if you look at or use mapPartitions() Regards Sab On 15-Feb-2016 8:38 am, "Christopher Brady" wrote: > I tried it without the cache, but it didn't change anything. The reason > for the cache is that other actions will be

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Sabarish Sasidharan
Looks like your executors are running out of memory. YARN is not kicking them out. Just increase the executor memory. Also considering increasing the parallelism ie the number of partitions. Regards Sab On 11-Feb-2016 5:46 am, "Nirav Patel" wrote: > In Yarn we have

Re: coalesce and executor memory

2016-02-14 Thread Christopher Brady
I tried it without the cache, but it didn't change anything. The reason for the cache is that other actions will be performed on this RDD, even though it never gets that far. I can make it work by just increasing the number of partitions, but I was hoping to get a better understanding of

Re: newbie unable to write to S3 403 forbidden error

2016-02-14 Thread Sabarish Sasidharan
Make sure you are using s3 bucket in same region. Also I would access my bucket this way s3n://bucketname/foldername. You can test privileges using the s3 cmd line client. Also, if you are using instance profiles you don't need to specify access and secret keys. No harm in specifying though.

Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Sabarish Sasidharan
Yes you can look at using the capacity scheduler or the fair scheduler with YARN. Both allow using full cluster when idle. And both allow considering cpu plus memory when allocating resources which is sort of necessary with Spark. Regards Sab On 13-Feb-2016 10:11 pm, "Eugene Morozov"

IllegalStateException : When use --executor-cores option in YARN

2016-02-14 Thread Divya Gehlot
Hi, I am starting spark-shell with following options : spark-shell --properties-file /TestDivya/Spark/Oracle.properties --jars /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --driver-class-path /usr/hdp/2.3.4.0-3485/spark/lib/ojdbc6.jar --packages com.databricks:spark-csv_2.10:1.1.0 --master

Re: Passing multiple jar files to spark-shell

2016-02-14 Thread Sathish Kumaran Vairavelu
--jars takes comma separated values. On Sun, Feb 14, 2016 at 5:35 PM Mich Talebzadeh wrote: > Hi, > > > > Is there anyway one can pass multiple --driver-class-path and multiple > –jars to spark shell. > > > > For example something as below with two jar files entries for

RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
Thanks very much Sab that did the trick. I can join a FACT table from Hive (ORC partitioned + bucketed) with dimension tables from Oracle Sounds like HiveContext is a superset of SQLContext Dr Mich Talebzadeh LinkedIn

RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Sabarish Sasidharan
The Hive context can be used instead of sql context even when you are accessing data from non-Hive sources like mysql or postgres for ex. It has better sql support than the sqlcontext as it uses the HiveQL parser. Regards Sab On 15-Feb-2016 3:07 am, "Mich Talebzadeh" wrote:

Passing multiple jar files to spark-shell

2016-02-14 Thread Mich Talebzadeh
Hi, Is there anyway one can pass multiple --driver-class-path and multiple -jars to spark shell. For example something as below with two jar files entries for Oracle (ojdbc6.jar) and Sybase IQ (jcoon4,jar) spark-shell --master spark://50.140.197.217:7077 --driver-class-path

Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
I am intending to get a table from Hive and register it as temporary table in Spark. I have created contexts for both Hive and Spark as below val sqlContext = new org.apache.spark.sql.SQLContext(sc) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) // I get the Hive

Re: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread ayan guha
Why can't you use the jdbc in hive context? I don't think sharing data across contexts are allowed. On 15 Feb 2016 07:22, "Mich Talebzadeh" wrote: > I am intending to get a table from Hive and register it as temporary table > in Spark. > > > > I have created contexts for

Running synchronized JRI code

2016-02-14 Thread Simon Hafner
Hello I'm currently running R code in an executor via JRI. Because R is single-threaded, any call to R needs to be wrapped in a `synchronized`. Now I can use a bit more than one core per executor, which is undesirable. Is there a way to tell spark that this specific application (or even specific

Re: Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread ayan guha
Have you tried repartition to larger number of partitions? Also, I would suggest increase number of executors and give them smaller amount of memory each. On 15 Feb 2016 06:49, "gustavolacerdas" wrote: > I have a machine with 96GB and 24 cores. I'm trying to run a

Re: Spark Certification

2016-02-14 Thread ayan guha
Thanks. Do we have any forum or study group for certification aspirants? I would like to join. On 15 Feb 2016 05:53, "Olivier Girardot" wrote: > It does not contain (as of yet) anything > 1.3 (for example in depth > knowledge of the Dataframe API) > but you need

RE: Trying to join a registered Hive table as temporary with two Oracle tables registered as temporary in Spark

2016-02-14 Thread Mich Talebzadeh
Thanks. I tried to access Hive table via JDBC (it works) through sqlContext scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@4f60415b scala> val s = sqlContext.load("jdbc", |

Re: Spark Certification

2016-02-14 Thread Olivier Girardot
It does not contain (as of yet) anything > 1.3 (for example in depth knowledge of the Dataframe API) but you need to know about all the modules (Core, Streaming, SQL, MLLib, GraphX) Regards, Olivier. 2016-02-11 19:31 GMT+01:00 Prem Sure : > I did recently. it includes

Spark Error: Not enough space to cache partition rdd

2016-02-14 Thread gustavolacerdas
I have a machine with 96GB and 24 cores.I'm trying to run a k-means algorithm with 30GB of input data.My spark-defaults.conf file are configured like that:spark.driver.memory 80gspark.executor.memory 70gspark.network.timeout 1200sspark.rdd.compress

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Koert Kuipers
great, by adding a little implicit wrapper i can use algebird's MonoidAggregator, which gives me the equivalent of GroupedDataset.mapValues (by using Aggregator.composePrepare) i am a little surprised you require a monoid and not just a semiring. but probably the right choice given possibly empty

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-14 Thread Olivier Girardot
you can also activate detail GC prints to get more infos 2016-02-11 7:43 GMT+01:00 Shiva Ramagopal : > How are you submitting/running the job - via spark-submit or as a plain > old Java program? > > If you are using spark-submit, you can control the memory setting via the >

RE: using udf to convert Oracle number column in Data Frame

2016-02-14 Thread Mich Talebzadeh
Hi Ted, Thanks for this. If generic functions exist then they are always faster and more efficient than UDFs from my experience. For example writing a UDF to do standard deviation in Oracle(nned this one for Oracle TimesTen IMDB) turned out not to be any quick compared to Oracle’s own

RE: Joining three tables with data frames

2016-02-14 Thread Mich Talebzadeh
Thanks Jeff, I registered the three data frames as temporary tables and performed the SQL query directly on them. I had to convert the oracle NUMBER and NUMBER(n,m) columns to TO_CHAR() at the query level to avoid the overflows. I think the fact that we can read data from JDBC databases

RE: coalesce and executor memory

2016-02-14 Thread Silvio Fiorito
Actually, rereading your email I see you're caching. But ‘cache’ uses MEMORY_ONLY. Do you see errors about losing partitions as your job is running? Are you sure you need to cache if you're just saving to disk? Can you try the coalesce without cache? From: Christopher

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Andy Davidson
Hi Michael From: Michael Armbrust Date: Saturday, February 13, 2016 at 9:31 PM To: Koert Kuipers Cc: "user @spark" Subject: Re: GroupedDataset needs a mapValues > Instead of grouping with a lambda function, you can do it

Re: Imported CSV file content isn't identical to the original file

2016-02-14 Thread SLiZn Liu
This Error message does not appear as I upgraded to 1.6.0 . -- Cheers, Todd Leo On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu wrote: > At least works for me though, temporarily disabled Kyro serilizer until > upgrade to 1.6.0. Appreciate for your update. :) > Luciano Resende

Re: Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-14 Thread Yuval.Itzchakov
Your question lacks sufficient information for us to actually provide help. Have you looked at the Spark UI to see which part of the graph is taking the longest? Have you tried logging your methods? -- View this message in context:

Using explain plan to optimize sql query

2016-02-14 Thread Mr rty ff
HiI have some queries that take a long time to execute so I used an df.explain(true) to print physical and logical plans to see where the bottlenecks.As the query is very complicated I got a very unreadable   result.How can I parse it to some thing more readable and  analyze it?And another