Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Hi Deenar, As suggested, I have moved the hive-site.xml from HADOOP_CONF_DIR ($SPARK_HOME/hadoop-conf) to YARN_CONF_DIR ($SPARK_HOME/conf/yarn-conf) and use the below to start pyspark, but the error is the exact same as before. $ HADOOP_CONF_DIR=$SPARK_HOME/hadoop-conf

Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
One more question, if i have a function which takes RDD as a parameter, how do we mock an RDD ?? On Thu, Oct 29, 2015 at 5:20 PM, Priya Ch wrote: > How do we do it for Cassandra..can we use the same Mocking ? > EmbeddedCassandra Server is available with

Re: Packaging a jar for a jdbc connection using sbt assembly and scala.

2015-10-29 Thread Dean Wood
Hi Actually, I'm using spark 1.5.1. I have it in standalone mode on my laptop for testing purposes at the moment. I have no doubt that if I were on a cluster, I'd have to put the jar on all the workers although I will point out that is really irritating to have to do that having built a fat jar

RE: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Sabarish Sasidharan
If you are writing to S3, also make sure that you are using the direct output committer. I don't have streaming jobs but it helps in my machine learning jobs. Also, though more partitions help in processing faster, they do slow down writes to S3. So you might want to coalesce before writing to S3.

Re: Packaging a jar for a jdbc connection using sbt assembly and scala.

2015-10-29 Thread Kai Wei
Submiting your app in client mode may help with your problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-jar-for-a-jdbc-connection-using-sbt-assembly-and-scala-tp25225p25228.html Sent from the Apache Spark User List mailing list archive at

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
Are you using Spark built with hive ? # Apache Hadoop 2.4.X with Hive 13 support mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package On 29 October 2015 at 13:08, Zoltan Fedor wrote: > Hi Deenar, > As suggested, I have

Re: spark-1.5.1 application detail ui url

2015-10-29 Thread Jean-Baptiste Onofré
Hi, The running application UI should be available on the worker IP (on 4040 default port), right ? So, basically, the problem is on the link of the master UI, correct ? Regards JB On 10/29/2015 01:45 PM, carlilek wrote: I administer an HPC cluster that runs Spark clusters as jobs. We run

Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
How do we do it for Cassandra..can we use the same Mocking ? EmbeddedCassandra Server is available with CassandraUnit. Can this be used in Spark Code as well ? I mean with Scala code ? On Thu, Oct 29, 2015 at 5:03 PM, Василец Дмитрий wrote: > there is example how i

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread Cody Koeninger
Consuming from kafka is inherently limited to using a number of consumer nodes less than or equal to the number of kafka partitions. If you think about it, you're going to be paying some network cost to repartition that data from a consumer to different processing nodes, regardless of what Spark

Re: nested select is not working in spark sql

2015-10-29 Thread Deenar Toraskar
You can try the following syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries SELECT * FROM A WHERE A.a IN (SELECT foo FROM B); Regards Deenar *Think Reactive Ltd* deenar.toras...@thinkreactive.co.uk 07714140812 On 28 October 2015 at 14:37, Richard Hillegas

Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Priya Ch
Hi All, For my Spark Streaming code, which writes the results to Cassandra DB, I need to write Unit test cases. what are the available test frameworks to mock the connection to Cassandra DB ?

[Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-29 Thread Yifan LI
Hey, I was just trying to scan a large RDD sortedRdd, ~1billion elements, using toLocalIterator api, but an exception returned as it was almost finished: java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Pivot Data in Spark and Scala

2015-10-29 Thread Ascot Moss
Hi, I have data as follows: A, 2015, 4 A, 2014, 12 A, 2013, 1 B, 2015, 24 B, 2013 4 I need to convert the data to a new format: A ,4,12,1 B, 24,,4 Any idea how to make it in Spark Scala? Thanks

Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Василец Дмитрий
there is example how i mock mysql import org.scalamock.scalatest.MockFactory val connectionMock = mock[java.sql.Connection] val statementMock = mock[PreparedStatement] (conMock.prepareStatement(_: String)).expects(sql.toString).returning(statementMock) (statementMock.executeUpdate

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-29 Thread Yifan LI
I have a guess that before scanning that RDD, I sorted it and set partitioning, so the result is not balanced: sortBy[S](f: Function [T, S], ascending: Boolean, numPartitions: Int) I will try to

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-29 Thread Deenar Toraskar
Hi Bryan For your use case you don't need to have multiple metastores. The default metastore uses embedded Derby . This cannot be shared amongst multiple

Re: Mock Cassandra DB Connection in Unit Testing

2015-10-29 Thread Adrian Tanase
Does it need to be a mock? Can you use sc.parallelize(data)? From: Priya Ch Date: Thursday, October 29, 2015 at 2:00 PM To: Василец Дмитрий Cc: "user@spark.apache.org", "spark-connector-u...@lists.datastax.com"

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Adrian Tanase
You can decouple the batch interval and the window sizes. If during processing you’re aggregating data and your operations benefit of an inverse function, then you can optimally process windows of data. E.g. You could set a global batch interval of 10 seconds. You can process the incoming data

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread Cody Koeninger
If you're writing to s3, want to avoid small files, and don't actually need 3 minute latency... you may want to consider just running a regular spark job (using KafkaUtils.createRDD) at scheduled intervals rather than a streaming job. On Thu, Oct 29, 2015 at 8:16 AM, Sabarish Sasidharan <

spark-1.5.1 application detail ui url

2015-10-29 Thread carlilek
I administer an HPC cluster that runs Spark clusters as jobs. We run Spark over the backend network (typically used for MPI), which is not accessible outside the cluster. Until we upgraded to 1.5.1 (from 1.3.1), this did not present a problem. Now the Application Detail UI link is returning the IP

Exception while reading from kafka stream

2015-10-29 Thread Ramkumar V
Hi, I'm trying to read from kafka stream and printing it textfile. I'm using java over spark. I dont know why i'm getting the following exception. Also exception message is very abstract. can anyone please help me ? Log Trace : 15/10/29 12:15:09 ERROR scheduler.JobScheduler: Error in job

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Yes, I am. It was compiled with the following: export SPARK_HADOOP_VERSION=2.5.0-cdh5.3.3 export SPARK_YARN=true export SPARK_HIVE=true export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0-cdh5.3.3 -Phive -Phive-thriftserver

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Yes, I have the hive-site.xml in $SPARK_HOME/conf, also in yarn-conf, in /etc/hive/conf, etc On Thu, Oct 29, 2015 at 10:46 AM, Kai Wei wrote: > Did you try copy it to spark/conf dir? > On 30 Oct 2015 1:42 am, "Zoltan Fedor" wrote: > >> There is

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Kai Wei
Failed to see a hadoop-2.5 profile in pom. Maybe that's the problem. On 30 Oct 2015 1:51 am, "Zoltan Fedor" wrote: > The funny thing is, that with Spark 1.2.0 on the same machine (Spark 1.2.0 > is the default shipped with CDH 5.3.3) the same hive-site.xml is being >

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Possible. Let me try to recompile with export SPARK_HADOOP_VERSION=2.5.0-cdh5.3.3 export SPARK_YARN=true export SPARK_HIVE=true export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.3 -Phive -Phive-thriftserver

Re: SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-29 Thread Michael Armbrust
Its super cheap. Its just a hashtable stored on the driver. Yes you can have more than one name for the same DF. On Wed, Oct 28, 2015 at 6:17 PM, Anfernee Xu wrote: > Hi, > > I just want to understand the cost of DataFrame.registerTempTable(String), > is it just a

[SPARK STREAMING ] Sending data to ElasticSearch

2015-10-29 Thread Nipun Arora
Hi, I am sending data to an elasticsearch deployment. The printing to file seems to work fine, but I keep getting no-node found for ES when I send data to it. I suspect there is some special way to handle the connection object? Can anyone explain what should be changed here? Thanks Nipun The

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Kai Wei
Create /user/biapp in hdfs manually first. On 30 Oct 2015 1:36 am, "Zoltan Fedor" wrote: > Sure, I did it with spark-shell, which seems to be showing the same error > - not using the hive-site.xml > > > $ HADOOP_CONF_DIR=$SPARK_HOME/hadoop-conf >

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Kai Wei
Did you try copy it to spark/conf dir? On 30 Oct 2015 1:42 am, "Zoltan Fedor" wrote: > There is /user/biapp in hdfs. The problem is that the hive-site.xml is > being ignored, so it is looking for it locally. > > On Thu, Oct 29, 2015 at 10:40 AM, Kai Wei

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
I dont know a lot about how pyspark works. Can you possibly try running spark-shell and do the same? sqlContext.sql("show databases").collect Deenar On 29 October 2015 at 14:18, Zoltan Fedor wrote: > Yes, I am. It was compiled with the following: > > export

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Sure, I did it with spark-shell, which seems to be showing the same error - not using the hive-site.xml $ HADOOP_CONF_DIR=$SPARK_HOME/hadoop-conf YARN_CONF_DIR=$SPARK_HOME/yarn-conf HADOOP_USER_NAME=biapp MASTER=yarn $SPARK_HOME/bin/pyspark --deploy-mode client --driver-class-path

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
The funny thing is, that with Spark 1.2.0 on the same machine (Spark 1.2.0 is the default shipped with CDH 5.3.3) the same hive-site.xml is being picked up and I have no problem whatsoever. On Thu, Oct 29, 2015 at 10:48 AM, Zoltan Fedor wrote: > Yes, I have the

Re: Collect Column as Array in Grouped DataFrame

2015-10-29 Thread Michael Armbrust
You can use a Hive UDF. import org.apache.spark.sql.functions._ callUDF("collect_set", $"columnName") or just SELECT collect_set(columnName) FROM ... Note that in 1.5 I think this actually does not use tungsten. In 1.6 it should though. I'll add that the experimental Dataset API (preview in

Re: Inconsistent Persistence of DataFrames in Spark 1.5

2015-10-29 Thread Michael Armbrust
There were several bugs in Spark 1.5 and we strongly recommend you upgrade to 1.5.1. If the issue persists it would be helpful to see the result of calling explain. On Wed, Oct 28, 2015 at 4:46 PM, wrote: > Hi, just a couple cents. > > > > are your joining columns

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
There is /user/biapp in hdfs. The problem is that the hive-site.xml is being ignored, so it is looking for it locally. On Thu, Oct 29, 2015 at 10:40 AM, Kai Wei wrote: > Create /user/biapp in hdfs manually first. > On 30 Oct 2015 1:36 am, "Zoltan Fedor"

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
i don't have spark-defaults.conf and spark-env.sh, so if you have a working Spark 1.5.1 with Hive metastore access on CDH 5.3 then could you please send over the settings you are having in your spark-defaults.conf and spark-env.sh? Thanks On Thu, Oct 29, 2015 at 11:14 AM, Deenar Toraskar

submitting custom metrics.properties file

2015-10-29 Thread Radu Brumariu
Hi, I am trying to submit a custom metrics.properties file to enable the collection of spark metrics, but I am having a hard time even starting it in local mode. spark-submit \ ... --files "./metrics.properties" --conf "spark.metrics.conf=metrics.properties" ... However I am

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
Here is what I did, maybe that will help you. 1) Downloaded spark-1.5.1 (With HAdoop 2.6.0) spark-1.5.1-bin-hadoop2.6 and extracted it on the edge node, set SPARK_HOME to this location 2) Copied the existing configuration (spark-defaults.conf and spark-env.sh) from your spark install

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
Zoltan you should have these in your existing CDH 5.3, that's the best place to get them. Find where spark is running from and should should have them My versions are here https://gist.github.com/deenar/08fc4ac0da3bdaff10fb Deenar On 29 October 2015 at 15:29, Zoltan Fedor

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Zoltan Fedor
Thanks. I didn't have a spark-defaults.conf, nor a spark-env.sh, so I copied yours and modified the references, so now I am back to where I started. Exact same error as before $ HADOOP_USER_NAME=biapp MASTER=yarn $SPARK_HOME/bin/pyspark --deploy-mode client Error: JAVA_HOME is not set and could

SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
I noticed when querying struct data in spark sql, we are requesting the whole column from parquet files. Is this intended or is there some kind of config to control this behaviour? Wouldn't it be better to request just the struct field?

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
Won't you be able to use case statement to generate a virtual column (like partition_num), then use analytic SQL partition by this virtual column? In this case, the full dataset will be just scanned once. Yong Date: Thu, 29 Oct 2015 10:51:53 -0700 Subject: RDD's filter() or using 'where'

How to properly read the first number lines of file into a RDD

2015-10-29 Thread Zhiliang Zhu
Hi All, There is some file with line number N + M,, as I need to read the first N lines into one RDD . 1. i) read all the N + M lines as one RDD, ii) select the RDD's top N rows, may be some one solution;2. if introduced some broadcast variable set N, then it is used to decide while map the

Loading dataframes to vertica database

2015-10-29 Thread spakle
http://www.sparkexpert.com/2015/04/17/save-apache-spark-dataframe-to-database/ Hi i tried to load dataframes(parquet files) using the above link into mysql it worked. But when i tried to load it into vertica database this is the error i am facing Exception in thread “main”

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Michael Armbrust
Yeah, this is unfortunate. It would be good to fix this, but its a non-trivial change. Tracked here if you'd like to vote on the issue: https://issues.apache.org/jira/browse/SPARK-4502 On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood wrote: > I noticed when querying struct

RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range 1 -> 100, I will break the whole range into below buckets 1

Re: newbie trouble submitting java app to AWS cluster I created using spark-ec2 script from spark-1.5.1-bin-hadoop2.6 distribution

2015-10-29 Thread Andy Davidson
Hi Robin and Sabarish I figure out what the problem To submit my java app so that it runs in cluster mode (ie. I can close my laptop and go home) I need to do the following 1. make sure my jar file is available on all the slaves. Spark-submit will cause my driver to run on a slave, It will not

sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found

2015-10-29 Thread tstewart
I have the following script in a file named test.R: library(SparkR) sc <- sparkR.init(master="yarn-client") sqlContext <- sparkRSQL.init(sc) df <- createDataFrame(sqlContext, faithful) showDF(df) sparkR.stop() q(save="no") If I submit this with "sparkR test.R" or "R CMD BATCH test.R" or

Aster Functions equivalent in spark : cfilter, npath and sessionize

2015-10-29 Thread didier vila
Good morning all, I am interesting to know if there is some Aster equivalent functions in Spark . In particular, I would like to sesionize or create some sessions based. I would to create some npath or sequence based based on a specific pattern.https://aster-community.teradata.com/docs/DOC-1544

Re: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Thanks Yong for your response. Let me see if I can understand what you're suggesting, so the whole data set, when I load them into Spark(I'm using custom Hadoop InputFormat), I will add an extra field to each element in RDD, like bucket_id. For example Key: 1 - 10, bucket_id=1 11-20,

Re: Aster Functions equivalent in spark : cfilter, npath and sessionize

2015-10-29 Thread Peyman Mohajerian
Some of the Aster functions you are referring to can be done using Window functions in SparkSQL: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 29, 2015 at 12:16 PM, didier vila wrote: > Good morning all, > > I am

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread varun sharma
Cody, adding partitions to kafka is there as a last resort, I was wondering if I can decrease the processing time by not touching my Kafka cluster. Adrian, repartition looks like a good option and let me check if I can gain performance. Dibyendu, will surely try out this consumer. Thanks all,

Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required

2015-10-29 Thread Jerry Wong
I used the spark 1.3.1 to populate the event logs to Cassandra. But there is an exception that I could not find out any clauses. Can anybody give me any helps? Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required at

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
You can do the SQL like following: select *, case when key >= 1 and key <=10 then 1 when key >= 11 and key <= 20 then 2 .. else 10 end as bucket_idfrom your table See the conditional functions "case" in the HIVE. After you have "bucket_id" column, then you can do whatever analytic function

Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Fernando Paladini
Hello guys! First of all, if you want to take a look in a more readable question, take a look in my StackOverflow question (I've made the same question there). I want to test Spark machine

Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Sandeep Giri
Dear All, If a continuous stream of text is coming in and you have to keep publishing the overall word count so far since 0:00 today, what would you do? Publishing the results for a window is easy but if we have to keep aggregating the results, how to go about it? I have tried to keep an

Issue on spark.driver.maxResultSize

2015-10-29 Thread karthik kadiyam
Hi, In spark streaming job i had the following setting this.jsc.getConf().set("spark.driver.maxResultSize", “0”); and i got the error in the job as below User class threw exception: Job aborted due to stage failure: Total size of serialized results of 120 tasks (1082.2 MB) is

RE: Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Silvio Fiorito
You could use updateStateByKey. There's a stateful word count example on Github. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala From: Sandeep

Re: SPARK SQL- Parquet projection pushdown for nested data

2015-10-29 Thread Sadhan Sood
Thanks Michael, I will upvote this. On Thu, Oct 29, 2015 at 10:29 AM, Michael Armbrust wrote: > Yeah, this is unfortunate. It would be good to fix this, but its a > non-trivial change. > > Tracked here if you'd like to vote on the issue: >

Save data to different S3

2015-10-29 Thread William Li
Hi - I have a simple app running fine with Spark, it reads data from S3 and performs calculation. When reading data from S3, I use hadoopConfiguration.set for both fs.s3n.awsAccessKeyId, and the fs.s3n.awsSecretAccessKey to it has permissions to load the data from customer sources. However,

Re: Save data to different S3

2015-10-29 Thread Zhang, Jingyu
Try s3://aws_key:aws_secret@bucketName/folderName with your access key and secret to save the data. On 30 October 2015 at 10:55, William Li wrote: > Hi – I have a simple app running fine with Spark, it reads data from S3 > and performs calculation. > > When reading data from

Need more tasks in KafkaDirectStream

2015-10-29 Thread varun sharma
Right now, there is one to one correspondence between kafka partitions and spark partitions. I dont have a requirement of one to one semantics. I need more tasks to be generated in the job so that it can be parallelised and batch can be completed fast. In the previous Receiver based approach

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread Dibyendu Bhattacharya
If you do not need one to one semantics and does not want strict ordering guarantee , you can very well use the Receiver based approach, and this consumer from Spark-Packages ( https://github.com/dibbhatt/kafka-spark-consumer) can give much better alternatives in terms of performance and

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
Did you try to cache a DataFrame with just a single row? Do you rows have any columns with null values? Can you post a code snippet here on how you load/generate the dataframe? Does dataframe.rdd.cache work? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Thu, Oct 29, 2015 at 4:33

Re: Spark/Kafka Streaming Job Gets Stuck

2015-10-29 Thread srungarapu vamsi
Other than @Adrian suggestions, check if the processing delay is more than the batch processing time. On Thu, Oct 29, 2015 at 2:23 AM, Adrian Tanase wrote: > Does it work as expected with smaller batch or smaller load? Could it be > that it's accumulating too many events over

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Zhang, Jingyu
Thanks Romi, I resize the dataset to 7MB, however, the code show NullPointerException exception as well. Did you try to cache a DataFrame with just a single row? Yes, I tried. But, Same problem. . Do you rows have any columns with null values? No, I had filter out null values before cache the

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
> > BUT, after change limit(500) to limit(1000). The code report > NullPointerException. > I had a similar situation, and the problem was with a certain record. Try to find which records are returned when you limit to 1000 but not returned when you limit to 500. Could it be a NPE thrown from

Packaging a jar for a jdbc connection using sbt assembly and scala.

2015-10-29 Thread dean.wood
I'm having a problem building a spark jar with scala. It's a really simple thing, I want to programatically access a mysql server via JDBC and load it in to a spark data frame. I can get this to work in the spark shell but I cannot package a jar that works with spark submit. It will package but

Spark standalone: zookeeper timeout configuration

2015-10-29 Thread zedar
Hi, in my standalone installation I use zookeeper for high availability (2 master nodes). Could you tell me if it is possible to configure zookeeper timeout (for checking if active master node is alive) and retry interval? Thanks, Robert -- View this message in context:

Re: Need more tasks in KafkaDirectStream

2015-10-29 Thread Adrian Tanase
You can call .repartition on the Dstream created by the Kafka direct consumer. You take the one-time hit of a shuffle but gain the ability to scale out processing beyond your number of partitions. We’re doing this to scale up from 36 partitions / topic to 140 partitions (20 cores * 7 nodes)

Re: Packaging a jar for a jdbc connection using sbt assembly and scala.

2015-10-29 Thread Deenar Toraskar
Hi Dean I guess you are using Spark 1.3. - The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
*Hi Zoltan* Add hive-site.xml to your YARN_CONF_DIR. i.e. $SPARK_HOME/conf/yarn-conf Deenar *Think Reactive Ltd* deenar.toras...@thinkreactive.co.uk 07714140812 On 28 October 2015 at 14:28, Zoltan Fedor wrote: > Hi, > We have a shared CDH 5.3.3 cluster and trying to

Re: Running FPGrowth over a JavaPairRDD?

2015-10-29 Thread Sabarish Sasidharan
Hi You cannot use PairRDD but you can use JavaRDD. So in your case, to make it work with least change, you would call run(transactions.values()). Each MLLib implementation has its own data structure typically and you would have to convert from your data structure before you invoke. For ex if you

??????SparkLauncher is blocked until main process is killed.

2015-10-29 Thread ??????
Some additional information: the main.process shares jar files with the Spark job's driver and executor as their classpaths. It could not be those files' read/write lock, right? ------ ??: "Ted Yu" : 2015??10??30?? 11:46:21 ??:

Spark 1.5.1 Dynamic Resource Allocation

2015-10-29 Thread tstewart
I am running the following command on a Hadoop cluster to launch Spark shell with DRA: spark-shell --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=4 --conf spark.dynamicAllocation.maxExecutors=12 --conf

RE: Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Sandeep Giri
Yes, update state by key worked. Though there are some more complications. On Oct 30, 2015 8:27 AM, "skaarthik oss" wrote: > Did you consider UpdateStateByKey operation? > > > > *From:* Sandeep Giri [mailto:sand...@knowbigdata.com] > *Sent:* Thursday, October 29, 2015

Re: issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-29 Thread shahid ashraf
Hi I guess you need to increase spark driver memory as well. But that should be set in conf files Let me know if that resolves On Oct 30, 2015 7:33 AM, "karthik kadiyam" wrote: > Hi, > > In spark streaming job i had the following setting > >

Re: How do I parallize Spark Jobs at Executor Level.

2015-10-29 Thread Vinoth Sankar
Hi Adrian, Yes. I need to load all files and process it in parallel. Following code doesn't seem working(Here I used map, even tried foreach) ,I just downloading the files from HDFS to local system and printing the logs count in each file. Its not throwing any Exceptions,but not working. Files

SparkLauncher is blocked until mail process is killed.

2015-10-29 Thread ??????
I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to submit a Spark Streaming job, however, in my test, the SparkSubmit process got stuck in the "addJar" procedure. Only when the main process (the caller of SparkLauncher) is killed, the submit procedeure continues to run. I

Re: SparkLauncher is blocked until mail process is killed.

2015-10-29 Thread Jey Kottalam
Could you please provide the jstack output? That would help the devs identify the blocking operation more easily. On Thu, Oct 29, 2015 at 6:54 PM, 陈宇航 wrote: > I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to > submit a Spark Streaming job,

issue with spark.driver.maxResultSize parameter in spark 1.3

2015-10-29 Thread karthik kadiyam
Hi, In spark streaming job i had the following setting this.jsc.getConf().set("spark.driver.maxResultSize", “0”); and i got the error in the job as below User class threw exception: Job aborted due to stage failure: Total size of serialized results of 120 tasks (1082.2 MB) is bigger

Re: [Spark] java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-29 Thread Deng Ching-Mallete
Hi Yifan, This is a known issue, please refer to https://issues.apache.org/jira/browse/SPARK-6235 for more details. In your case, it looks like you are caching to disk a partition > 2G. A workaround would be to increase the number of your RDD partitions in order to make them smaller in size.

Re: SparkLauncher is blocked until main process is killed.

2015-10-29 Thread Ted Yu
Which Spark release are you using ? Please note the typo in email subject (corrected as of this reply) On Thu, Oct 29, 2015 at 7:00 PM, Jey Kottalam wrote: > Could you please provide the jstack output? That would help the devs > identify the blocking operation more

Re: Pivot Data in Spark and Scala

2015-10-29 Thread Deng Ching-Mallete
Hi, You could transform it into a pair RDD then use the combineByKey function. HTH, Deng On Thu, Oct 29, 2015 at 7:29 PM, Ascot Moss wrote: > Hi, > > I have data as follows: > > A, 2015, 4 > A, 2014, 12 > A, 2013, 1 > B, 2015, 24 > B, 2013 4 > > > I need to convert the