Re: --packages configuration equivalent item name?

2016-04-04 Thread Saisai Shao
spark.jars.ivy, spark.jars.packages, spark.jars.excludes is the configurations you can use. Thanks Saisai On Sun, Apr 3, 2016 at 1:59 AM, Russell Jurney wrote: > Thanks, Andy! > > On Mon, Mar 28, 2016 at 8:44 AM, Andy Davidson < > a...@santacruzintegration.com> wrote:

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
Dear all, Thank you for your responses. Michael Slavitch: > Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly > propagated to all nodes? Are they identical? Yes; these files are stored on a shared memory directory accessible to all nodes. Koert Kuipers: > we ran into

Standard Deviation in Hive 2 is still incorrect

2016-04-04 Thread Mich Talebzadeh
Hi, I reported back in April 2015 that what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified in Hive 2 Both Oracle and Sybase point STDDEV to STDDEV_SAMP not STDDEV_POP. Also I did tests with Spark 1.6 as well and Spark

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
The only way I've found to make it work now is by using the current spark context and changing its configuration using spark-shell options. Which is really different from pyspark where you can't instantiate a new one, initialize it etc.. > On Apr 4, 2016, at 18:16, Cyril Scetbon

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu
bq. I'm on version 2.10 for spark The above is Scala version. Can you give us the Spark version ? Thanks On Mon, Apr 4, 2016 at 2:36 PM, mpawashe wrote: > Hi all, > > I am using Spark Streaming API (I'm on version 2.10 for spark and > streaming), and I am running into a

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
It doesn't as you can see : http://pastebin.com/nKcMCtGb I don't need to set the master as I'm using Yarn and I'm on one of the yarn nodes. When I instantiate the Spark Streaming Context with the spark conf, it tries to create a new Spark Context but even with

Spark Streaming and Performance

2016-04-04 Thread Mich Talebzadeh
Hi, Having started using Spark Streaming with Kafka it seems that it offers a number of good opportunities. I am considering using for Complex Event Processing (CEP) by building CEP adaptors and Spark transformers. Anyway before going there I would like to see if anyone has done benchmarks on

Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread mpawashe
Hi all, I am using Spark Streaming API (I'm on version 2.10 for spark and streaming), and I am running into a function serialization issue that I do not run into when using Spark in batch (non-streaming) mode. If I wrote code like this: def run(): Unit = { val newStream = stream.map(x => {

Re: pyspark unable to convert dataframe column to a vector: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2016-04-04 Thread Andy Davidson
Hi Jeff Sorry I did not respond sooner. I was out of town Here is the code I use to initialize the HiveContext # load data set from pyspark.sql import HiveContext #,SQLContext, Row # window functions require HiveContext (spark 2.x will not require hive) #sqlContext = SQLContext(sc)

Re: RDDs caching in typical machine learning use cases

2016-04-04 Thread Eugene Morozov
Hi, Yes, I believe people do that. I also believe that SparkML is able to figure out when to cache some internal RDD also. That's definitely true for random forest algo. It doesn't harm to cache the same RDD twice, too. But it's not clear what'd you want to know... -- Be well! Jean Morozov On

Re: spark-shell failing but pyspark works

2016-04-04 Thread Mich Talebzadeh
Hi Cyril, You can connect to Spark shell from any node. The connection is made to master through --master IP Address like below: spark-shell --master spark://50.140.197.217:7077 Now in the Scala code you can specify something like below: val sparkConf = new SparkConf().

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
I suppose it doesn't work using spark-shell too ? If you can confirm Thanks > On Apr 3, 2016, at 03:39, Mich Talebzadeh wrote: > > This works fine for me > > val sparkConf = new SparkConf(). > setAppName("StreamTest"). >

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-04 Thread Mich Talebzadeh
Actually this may not be a bug. It just the Optimizer decides to do a nested loop join over Hash Join when more that two OR joins are involved With one equality predicate Hash JOin is chosen 4> SELECT COUNT(SALES.PROD_ID) from SALES, SALES2 5> WHERE SALES.CUST_ID = SALES2.CUST_ID 6> go QUERY

Re: Support for time column type?

2016-04-04 Thread Philip Weaver
Hmm, yeah it looks like I could use that to represent time since start of day. I'm porting existing large SQL queries from Postgres to Spark SQL for a quickPOC, so I'd prefer not to have to make many changes to it. I'm not sure if the CalendarIntervalType can be used as a drop-in replacement (i.e.

Re: All inclusive uber-jar

2016-04-04 Thread Haroon Rasheed
To add to Mich, I put the build.sb under the Myproject root folder : MyProject/buildt.sbt and the assembly.sbt is placed in the folder called "project" under the MyProject folder: MyProject/project/assembly.sbt also the the first line in build.sbt is to import the assembly keys as below: import

Re: All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh
Thanks all Actually I have a generic shell script that for a given Scala program creates jar file using Maven or SBT. I modified that one to create a uber jar file as well using SBT assembly. The root directory structure in uber has one more sub-directory. In addition SBT relies on a single sbt

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to

SparkDriver throwing java.lang.OutOfMemoryError: Java heap space

2016-04-04 Thread Nirav Patel
Hi, We are using spark 1.5.2 and recently hitting this issue after our dataset grew from 140GB to 160GB. Error is thrown during shuffle fetch on reduce side which all should happen on executors and executor should report them! However its gets reported only on driver. SparkContext gets shutdown

Fwd: All inclusive uber-jar

2016-04-04 Thread vetal king
-- Forwarded message -- From: vetal king Date: Mon, Apr 4, 2016 at 8:59 PM Subject: Re: All inclusive uber-jar To: Mich Talebzadeh Not sure how to create uber jar using sbt, but this is how you can do it using maven.

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Michael Slavitch
Just to be sure: Has spark-env.sh and spark-defaults.conf been correctly propagated to all nodes? Are they identical? > On Apr 4, 2016, at 9:12 AM, Mike Hynes <91m...@gmail.com> wrote: > > [ CC'ing dev list since nearly identical questions have occurred in > user list recently w/o

Re: All inclusive uber-jar

2016-04-04 Thread Marco Mistroni
Hi U can use SBT assembly to create uber jar. U should set spark libraries as 'provided' in ur SBT Hth Marco Ps apologies if by any chances I m telling u something u already know On 4 Apr 2016 2:36 pm, "Mich Talebzadeh" wrote: > Hi, > > > When one builds a project for

All inclusive uber-jar

2016-04-04 Thread Mich Talebzadeh
Hi, When one builds a project for Spark in this case Spark streaming with SBT, as usual I add dependencies as follows: libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "1.6.1" However when I submit

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
[ CC'ing dev list since nearly identical questions have occurred in user list recently w/o resolution; c.f.: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html

Failed to get broadcast_1_piece0 of broadcast_1

2016-04-04 Thread Akhilesh Pathodia
Hi, I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in

Spark 1.6.1 binary pre-built for Hadoop 2.6 may be broken

2016-04-04 Thread Kousuke Saruta
Hi all, I noticed the binary pre-build for Hadoop 2.6 which we can download from spark.apache.org/downloads.html (Direct Download) may be broken. I couldn't decompress at least following 4 tgzs with "tar xfzv" command and md5-checksum did't match. * spark-1.6.1-bin-hadoop2.6.tgz *

Re:

2016-04-04 Thread Akhil Das
1 core with 4 partitions means it executes it one by one, not parallel. For the Kafka question, if you don't have higher data volume then you may not need 40 partitions. Thanks Best Regards On Sat, Apr 2, 2016 at 7:35 PM, Hemalatha A < hemalatha.amru...@googlemail.com> wrote: > Hello, > > As

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
I wasn't knowing you have a parquet file containing json data. Thanks Best Regards On Mon, Apr 4, 2016 at 2:44 PM, Ramkumar V wrote: > Hi Akhil, > > Thanks for your help. Why do you put separator as "," ? > > I have a parquet file which contains only json in each line.

Re: Read Parquet in Java Spark

2016-04-04 Thread Ramkumar V
Hi Akhil, Thanks for your help. Why do you put separator as "," ? I have a parquet file which contains only json in each line. *Thanks*, On Mon, Apr 4, 2016 at 2:34 PM, Akhil Das wrote: > Something like this (in scala): >

Re: Read Parquet in Java Spark

2016-04-04 Thread Akhil Das
Something like this (in scala): val rdd = parquetFile.javaRDD().map(row => row.mkstring(",")) You can create a map operation over your javaRDD to convert the org.apache.spark.sql.Row to String (the Row.mkstring()

Re: How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Mich Talebzadeh
This all depends if you provide information to Driver on the underlying RDBMS table and assuming that there is a unique ID on the underlying table you can use to partition the load. Have a look at this http://metricbrew.com/get-data-from-databases-with-apache-spark-jdbc/ HTH Dr Mich Talebzadeh

How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Zhang, Jingyu
Hi All, I want read Mysql from Spark. Please let me know how many threads will be used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is the best practice to read large dataset from RDBMS to Spark? Thanks, Jingyu -- This message and its attachments may contain legally

Re: Where to set properties for the retainedJobs/Stages?

2016-04-04 Thread Max Schmidt
Okay I put the props in the spark-defaults, but they are not recognized, as they don't appear in the 'Environment' tab during a application execution. spark.eventLog.enabled for example. Am 01.04.2016 um 21:22 schrieb Ted Yu: > Please > read >

Re: Read Parquet in Java Spark

2016-04-04 Thread Ramkumar V
Any idea on this ? How to convert parquet file into JavaRDD ? *Thanks*, On Thu, Mar 31, 2016 at 4:30 PM, Ramkumar V wrote: > Hi, > > Thanks for the reply. I tried this. It's returning JavaRDD instead > of JavaRDD. How to get