Regarding rdd.collect()

2015-08-17 Thread praveen S
When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors?

Exception when S3 path contains colons

2015-08-17 Thread Brian Stempin
Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala> val files = sc.textFile("s3a://redactedbucketname/*") 2015-08-18 04:38:34,567 INFO [main] storage.MemoryS

Re: java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteString

2015-08-17 Thread stark_summer
approach1: submit spark job add bolow: --conf "spark.driver.extraClassPath=/home/cluster/apps/hbase/lib/hbase-protocol-0.98.1-cdh5.1.0.jar" --conf "spark.executor.extraClassPath=/home/cluster/apps/hbase/lib/hbase-protocol-0.98.1-cdh5.1.0.jar" such as: /home/dp/spark/spark-1.4/spark-1.4.1/bin/spar

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-17 Thread Alun Champion
Yes, they both are set. Just recompiled and still no success, silent failure. Which versions of java and scala are you using? On 17 August 2015 at 19:59, Charlie Hack wrote: > I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1 > using these instructions >

Re: java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteString

2015-08-17 Thread Ted Yu
Have you tried adding path to hbase-protocol jar to spark.driver.extraClassPath and spark.executor.extraClassPath ? Cheers On Mon, Aug 17, 2015 at 7:51 PM, stark_summer wrote: > spark vesion:1.4.1 > java version:1.7 > hadoop version: > Hadoop 2.3.0-cdh5.1.0 > > submit spark job to yarn cluster

java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteString

2015-08-17 Thread stark_summer
spark vesion:1.4.1 java version:1.7 hadoop version: Hadoop 2.3.0-cdh5.1.0 submit spark job to yarn cluster that read hbase data,after job running, it comes below error : 15/08/17 19:28:33 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.hadoop.hbase.DoNotRetryIOException: jav

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-17 Thread Charlie Hack
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1 using these instructions (using `$ sbt/sbt clean assembly`, with the additional step of downloading the proper sbt-launch.jar (0.13.7) from

Re: Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Joseph Bradley
I'd recommend using the built-in save and load, which will be better for cross-version compatibility. You should be able to call myModel.save(path), and load it back with MatrixFactorizationModel.load(path). On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa wrote: > Hi All, > > I have an issue when

Spark 1.4.1 - Mac OSX Yosemite

2015-08-17 Thread Alun Champion
Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite? I'm been running a standalone 1.3.1 fine but it failed when trying to run 1.4.1. (I also trie 1.4.0). I've tried both the pre-built packages as well as compiling from source, both with the same results (I can successfully co

Re: grpah x issue spark 1.3

2015-08-17 Thread David Zeelen
the code below is taken from the spark website and generates the error detailed Hi using spark 1.3 and trying some sample code: val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica",

how do I execute a job on a single worker node in standalone mode

2015-08-17 Thread Axel Dahl
I have a 4 node cluster and have been playing around with the num-executors parameters, executor-memory and executor-cores I set the following: --executor-memory=10G --num-executors=1 --executor-cores=8 But when I run the job, I see that each worker, is running one executor which has 2 cores and

Python's ReduceByKeyAndWindow DStream Keeps Growing

2015-08-17 Thread Asim Jalis
When I use reduceByKeyAndWindow with func and invFunc (in PySpark) the size of the window keeps growing. I am appending the code that reproduces this issue. This prints out the count() of the dstream which goes up every batch by 10 elements. Is this a bug in the Python version of Scala or is this

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Stephen Boesch
In 1.4 it is change-scala-version.sh 2.11 But the problem was it is a -Dscala-211 not a -P. I misread the doc's. 2015-08-17 14:17 GMT-07:00 Ted Yu : > You were building against 1.4.x, right ? > > In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11 > profile. > > FYI > > On

Re: Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-17 Thread Ted Yu
You were building against 1.4.x, right ? In master branch, switch-to-scala-2.11.sh is gone. There is scala-2.11 profile. FYI On Sun, Aug 16, 2015 at 11:12 AM, Stephen Boesch wrote: > > I am building spark with the following options - most notably the > **scala-2.11**: > > . dev/switch-to-scal

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
Hi Nick, I forgot to mention in the survey that ganglia is never installed properly for some reasons. I have this exception every time I launched the cluster: Starting httpd: httpd: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so into serv

Embarassingly parallel computation in SparkR?

2015-08-17 Thread Kristina Rogale Plazonic
Hi, I'm wondering how to achieve, say, a Monte Carlo simulation in SparkR without use of low level RDD functions that were made private in 1.4, such as parallelize and map. Something like parallelize(sc, 1:1000).map ( ### R code that does my computation ) where the code is the same on every n

registering an empty RDD as a temp table in a PySpark SQL context

2015-08-17 Thread Eric Walker
I have an RDD queried from a scan of a data source. Sometimes the RDD has rows and at other times it has none. I would like to register this RDD as a temporary table in a SQL context. I suspect this will work in Scala, but in PySpark some code assumes that the RDD has rows in it, which are used

Re: issue Running Spark Job on Yarn Cluster

2015-08-17 Thread poolis
Did you resolve this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24300.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-17 Thread Steve Loughran
with the right ftp client JAR on your classpath (I forget which), you can use ftp:// a a source for a hadoop FS operation. you may even be able to use it as an input for some spark (non streaming job directly. On 14 Aug 2015, at 14:11, Varadhan, Jawahar mailto:varad...@yahoo.com.INVALID>> wro

Re: rdd count is throwing null pointer exception

2015-08-17 Thread Priya Ch
Looks like because of Spark-5063 RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map trans

Re: Left outer joining big data set with small lookups

2015-08-17 Thread Silvio Fiorito
Try doing a count on both lookups to force the caching to occur before the join. On 8/17/15, 12:39 PM, "VIJAYAKUMAR JAWAHARLAL" wrote: >Thanks for your help > >I tried to cache the lookup tables and left out join with the big table (DF). >Join does not seem to be using broadcast join-still i

Re: Left outer joining big data set with small lookups

2015-08-17 Thread VIJAYAKUMAR JAWAHARLAL
Thanks for your help I tried to cache the lookup tables and left out join with the big table (DF). Join does not seem to be using broadcast join-still it goes with hash partition join and shuffling big table. Here is the scenario … table1 as big_df left outer join table2 as lkup on big_df.lkup

Calling hiveContext.sql("insert into table xyz...") in multiple threads?

2015-08-17 Thread unk1102
Hi I have around 2000 Hive source partitions to process and insert data into same table and different partition. For e.g. I have the following query hiveContext.sql("insert into table myTable partition(mypartition="someparition") bla bla) If I call above query in Spark driver program it runs fine

What's the logic in RangePartitioner.rangeBounds method of Apache Spark

2015-08-17 Thread ihainan
*Firstly so sorry for my poor English.* I was reading the source code of Apache Spark 1.4.1 and I really got stuck at the logic of RangePartitioner.rangeBounds method. The code is shown below. So can anyone please explain me that: 1. What is "3.0 *" for in the code line of "val sampleSizePerPa

Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-17 Thread unk1102
val numStreams = 4 val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) } In a Java in a for loop you will create four streams using KafkaUtils.createStream() so that each receiver will run in different threads for more information please visit http://spark.apache.org/doc

Re: S3n, parallelism, partitions

2015-08-17 Thread Akshat Aranya
This will also depend on the file format you are using. A word of advice: you would be much better off with the s3a file system. As I found out recently the hard way, s3n has some issues with reading through entire files even when looking for headers. On Mon, Aug 17, 2015 at 2:10 AM, Akhil Das w

[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas
Howdy folks! I’m interested in hearing about what people think of spark-ec2 outside of the formal JIRA process. Your answers will all be anonymous and public. If the embedded form below doesn’t work for you, you can use this link to get the s

Spark Job Hangs on our production cluster

2015-08-17 Thread java8964
I am comparing the log of Spark line by line between the hanging case (big dataset) and not hanging case (small dataset). In the hanging case, the Spark's log looks identical with not hanging case for reading the first block data from the HDFS. But after that, starting from line 438 in the spark

Re: Paper on Spark SQL

2015-08-17 Thread Ted Yu
Thanks Nan. That is why I always put an extra space between URL and punctuation in my comments / emails. On Mon, Aug 17, 2015 at 6:31 AM, Nan Zhu wrote: > an extra “,” is at the end > > -- > Nan Zhu > http://codingcat.me > > On Monday, August 17, 2015 at 9:28 AM, Ted Yu wrote: > > I got 404 whe

Re: grpah x issue spark 1.3

2015-08-17 Thread Sonal Goyal
I have been using graphx in production on 1.3 and 1.4 with no issues. What's the exception you see and what are you trying to do? On Aug 17, 2015 10:49 AM, "dizzy5112" wrote: > Hi using spark 1.3 and trying some sample code: > > > when i run: > > all works well but with > > it falls over and i g

Re: rdd count is throwing null pointer exception

2015-08-17 Thread Preetam
The error could be because of the missing brackets after the word cache - .ticketRdd.cache() > On Aug 17, 2015, at 7:26 AM, Priya Ch wrote: > > Hi All, > > Thank you very much for the detailed explanation. > > I have scenario like this- > I have rdd of ticket records and another rdd of book

rdd count is throwing null pointer exception

2015-08-17 Thread Priya Ch
Hi All, Thank you very much for the detailed explanation. I have scenario like this- I have rdd of ticket records and another rdd of booking records. for each ticket record, i need to check whether any link exists in booking table. val ticketCachedRdd = ticketRdd.cache ticketRdd.foreach{ ticke

Re: Spark Interview Questions

2015-08-17 Thread Sandeep Giri
This statement is from the Spark's website itself. Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] [image: other site icon]

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-17 Thread Cody Koeninger
Look at the definitions of the java-specific KafkaUtils.createDirectStream methods (the ones that take a JavaStreamingContext) On Mon, Aug 17, 2015 at 5:13 AM, Shushant Arora wrote: > How to create classtag in java ?Also Constructor > of DirectKafkaInputDStream takes Function1 not Function but >

Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Madawa Soysa
Hi All, I have an issue when i try to serialize a MatrixFactorizationModel object as a java object in a Java application. When I deserialize the object, I get the following exception. Caused by: java.lang.ClassNotFoundException: org.apache.spark.OneToOneDependency cannot be found by org.scala-lan

Re: Paper on Spark SQL

2015-08-17 Thread Nan Zhu
an extra “,” is at the end -- Nan Zhu http://codingcat.me On Monday, August 17, 2015 at 9:28 AM, Ted Yu wrote: > I got 404 when trying to access the link. > > > > On Aug 17, 2015, at 5:31 AM, Todd mailto:bit1...@163.com)> > wrote: > > > Hi, > > I can't access > > http://people.csa

Re: Paper on Spark SQL

2015-08-17 Thread Ted Yu
I got 404 when trying to access the link. > On Aug 17, 2015, at 5:31 AM, Todd wrote: > > Hi, > I can't access > http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf. > Could someone help try to see if it is available and reply with it?Thanks!

Paper on Spark SQL

2015-08-17 Thread Todd
Hi, I can't access http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf. Could someone help try to see if it is available and reply with it?Thanks!

Re: SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-17 Thread xiaohe lan
Yeah, lots of libraries needs to be changed to compile in order to run the examples in intellij. Thanks, Xiaohe On Mon, Aug 17, 2015 at 10:01 AM, Jeff Zhang wrote: > Check module example's dependency (right click examples and click Open > Modules Settings), by default scala-library is provided,

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-17 Thread Shushant Arora
How to create classtag in java ?Also Constructor of DirectKafkaInputDStream takes Function1 not Function but kafkautils.createDirectStream allows function. I have below as overriden DirectKafkaInputDStream. public class CustomDirectKafkaInputDstream extends DirectKafkaInputDStream{ public Custo

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak
Or can I generally create new RDD from transformation and enrich its partitions with some metadata so that I would copy OffsetRanges in my new RDD in DStream? On Mon, Aug 17, 2015 at 1:08 PM, Petr Novak wrote: > Hi all, > I need to transform KafkaRDD into a new stream of deserialized case > clas

Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-17 Thread Petr Novak
Hi all, I need to transform KafkaRDD into a new stream of deserialized case classes. I want to use the new stream to save it to file and to perform additional transformations on it. To save it I want to use offsets in filenames, hence I need OffsetRanges in transformed RDD. But KafkaRDD is private

Re: Meaning of local[2]

2015-08-17 Thread Daniel Darabos
Hi Praveen, On Mon, Aug 17, 2015 at 12:34 PM, praveen S wrote: > What does this mean in .setMaster("local[2]") > Local mode (executor in the same JVM) with 2 executor threads. > Is this applicable only for standalone Mode? > It is not applicable for standalone mode, only for local. > Can I do

Meaning of local[2]

2015-08-17 Thread praveen S
What does this mean in .setMaster("local[2]") Is this applicable only for standalone Mode? Can I do this in a cluster setup, eg: . setMaster("[2]").. Is it number of threads per worker node?

Re: Too many files/dirs in hdfs

2015-08-17 Thread UMESH CHAUDHARY
In Spark Streaming you can simply check whether your RDD contains any records or not and if records are there you can save them using FIleOutputStream: DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE YOUR STUFF} }; This will not create unnecessary files of 0 bytes. On Mon,

Re: Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-17 Thread Akhil Das
It could be stuck on a GC pause, Can you check a bit more in the executor logs and see whats going on? Also from the driver UI you would get to know at which stage it is being stuck etc. Thanks Best Regards On Sun, Aug 16, 2015 at 11:45 PM, unk1102 wrote: > Hi I have written Spark job which see

Re: Spark hangs on collect (stuck on scheduler delay)

2015-08-17 Thread Akhil Das
You need to debug further and figure out the bottle neck. Why are you doing a collect? If the dataset is too huge that will mostly hung the driver machine. It would be good if you can paste the sample code, without that its really hard to understand the flow of your program. Thanks Best Regards O

Re: How to run spark in standalone mode on cassandra with high availability?

2015-08-17 Thread Akhil Das
Have a look at Mesos. Thanks Best Regards On Sat, Aug 15, 2015 at 1:03 PM, Vikram Kone wrote: > Hi, > We are planning to install Spark in stand alone mode on cassandra cluster. > The problem, is since Cassandra has a no-SPOF architecture ie any node can > become the master for the cluster, it c

Re: Too many files/dirs in hdfs

2015-08-17 Thread Akhil Das
Currently, spark streaming would create a new directory for every batch and store the data to it (whether it has anything or not). There is no direct append call as of now, but you can achieve this either with FileUtil.copyMerge

Re: Help with persist: Data is requested again

2015-08-17 Thread Akhil Das
Are you triggering an action within the while loop? How are you loading the data from jdbc? You need to make sure the job has enough partitions to run parallel to increase the performance. Thanks Best Regards On Sat, Aug 15, 2015 at 2:41 AM, wrote: > Hello all, > > I am writing a program which

Re: Cannot cast to Tuple when running in cluster mode

2015-08-17 Thread Akhil Das
That looks like scala version mismatch. Thanks Best Regards On Fri, Aug 14, 2015 at 9:04 PM, wrote: > Hi All, > > I have a working program, in which I create two big tuples2 out of the > data. This seems to work in local but when I switch over cluster standalone > mode, I get this error at the

Re: S3n, parallelism, partitions

2015-08-17 Thread Akhil Das
s3n underneath uses the hadoop api, so i guess it would partition according to your hadoop configuration (128MB per partition by default) Thanks Best Regards On Mon, Aug 17, 2015 at 2:29 PM, matd wrote: > Hello, > > I would like to understand how the work is parallelized accross a Spark > clust

S3n, parallelism, partitions

2015-08-17 Thread matd
Hello, I would like to understand how the work is parallelized accross a Spark cluster (and what is left to the driver) when I read several files from a single folder in s3 "s3n://bucket_xyz/some_folder_having_many_files_in_it/" How files (or file parts) are mapped to partitions ? Thanks Mathie

Programmatically create SparkContext on YARN

2015-08-17 Thread Andreas Fritzler
Hi all, when runnig the Spark cluster in standalone mode I am able to create the Spark context from Java via the following code snippet: SparkConf conf = new SparkConf() >.setAppName("MySparkApp") >.setMaster("spark://SPARK_MASTER:7077") >.setJars(jars); > JavaSparkContext sc = new Ja