Re: Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-06 Thread Akhil Das
You have an issue with your cluster setup. Can you paste your conf/spark-env.sh and the conf/slaves files here? The reason why your job is running fine is because you set the master inside the job as local[*] which runs in local mode (not in standalone cluster mode). Thanks Best Regards On

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
What is called Bolt in Storm is essentially a combination of [Transformation/Action and DStream RDD] in Spark – so to achieve a higher parallelism for specific Transformation/Action on specific Dstream RDD simply repartition it to the required number of partitions which directly relates to the

large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-06 Thread Michal Haris
Just wanted to check if somebody has seen similar behaviour or knows what we might be doing wrong. We have a relatively complex spark application which processes half a terabyte of data at various stages. We have profiled it in several ways and everything seems to point to one place where 90% of

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, I agree with Evo, Spark works at a different abstraction level than Storm, and there is not a direct translation from Storm topologies to Spark Streaming jobs. I think something remotely close is the notion of lineage of DStreams or RDDs, which is similar to a logical plan of an engine like

Re: Error in SparkSQL/Scala IDE

2015-05-06 Thread Iulian Dragoș
Hi, I just saw this question. I posted my solution to this stack overflow question. https://stackoverflow.com/questions/29796928/whats-the-most-efficient-way-to-filter-a-dataframe Scala reflection can take a classloader when creating a mirror ( universe.runtimeMirror(loader)). I can have a look,

Re: Creating topology in spark streaming

2015-05-06 Thread Juan Rodríguez Hortalá
Hi, You can use the method repartition from DStream (for the Scala API) or JavaDStream (for the Java API) defrepartition(numPartitions: Int): DStream https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html [T] Return a new DStream with an increased or

Re:MLlib libsvm isssues with data

2015-05-06 Thread doyere
Hi all, After do some tests,finally I solve it.I wrote here for other people who met this question. here's a example of data format error I faced 0 0:0 1:0 2:1 1 1:1 3:2 the data for 0:0 and 1:0/1:1 is the reason for ArrayIndexOutOfBoundsException.If someone who faced the same question just

Re: SparkR: filter() function?

2015-05-06 Thread himaeda
Has this issue re-appeared? I posted this on SO before I knew about this list... http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working Also I don't have access to the issues on

The explanation of input text format using LDA in Spark

2015-05-06 Thread Cui xp
Hi all, After I read the example code using LDA in Spark, I found the input text in the code is a matrix. the format of the text is as follows: 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2

Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Thanks alot Juan, That was a great post, One more thing if u can .Any there any demo/blog telling how to configure or create a topology of different types .. i mean how we can decide the pipelining model in spark as done in storm for

RE: Map one RDD into two RDD

2015-05-06 Thread Evo Eftimov
RDD1 = RDD.filter() RDD2 = RDD.filter() From: Bill Q [mailto:bill.q@gmail.com] Sent: Tuesday, May 5, 2015 10:42 PM To: user@spark.apache.org Subject: Map one RDD into two RDD Hi all, I have a large RDD that I map a function to it. Based on the nature of each record in the input

Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread Night Wolf
Hi, If I have an RDD[MyClass] and I want to partition it by the hash code of MyClass for performance reasons, is there any way to do this without converting it into a PairRDD RDD[(K,V)] and calling partitionBy??? Mapping it to a tuple2 seems like a waste of space/computation. It looks like the

Re: Creating topology in spark streaming

2015-05-06 Thread anshu shukla
But main problem is how to increase the level of parallelism for any particular bolt logic . suppose i want this type of topology . https://storm.apache.org/documentation/images/topology.png How we can manage it . On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote: Every

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread ayan guha
How does your MyClqss looks like? I was experimenting with Row class in python and apparently partitionby automatically takes first column as key. However, I am not sure how you can access a part of an object without deserializing it (either explicitly or Spark doing it for you) On Wed, May

spark jobs input/output http request

2015-05-06 Thread Saurabh Gupta
I have setup an AWS EMR based cluster, where in I am being able to run my spark queries quite ok. The next part of my work is to run the queries coming in from a webclient and show the results at it. The Spark queries as i Know I can only run from my EMR, and they don't return instantly with any

Job executed with no data in Spark Straming.

2015-05-06 Thread secfree
My code: Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) . 1. Fun01, Fun03 both executed as expected, which prove messages is not null or empty. 2. On Spark application UI, I found Fun02's output stage in Spark stages, which prove executed. 3. The first line of

回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread luohui20001
update status after i did some tests. I modified some other parameters, found 2 parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions before Today I used default setting of spark_worker_instance and spark.sql.shuffle.partitions whose value is 1 and 200.At that time ,

Re: com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??

2015-05-06 Thread Akhil Das
I don't see spark-streaming dependency at com.datastax.spark http://mvnrepository.com/artifact/com.datastax.spark, but it does has a kafka-streaming dependency though. Thanks Best Regards On Tue, May 5, 2015 at 12:42 AM, Eric Ho eric...@intel.com wrote: Can I specify this in my build file ?

Re: Creating topology in spark streaming

2015-05-06 Thread ayan guha
Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Saisai Shao
Also Kafka has a Hadoop consumer API for doing such things, please refer to http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi 2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com: why not try https://github.com/linkedin/camus - camus is kafka to HDFS pipeline On

Re: OOM error with GMMs on 4GB dataset

2015-05-06 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote: Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also

Re: Spark Mongodb connection

2015-05-06 Thread Akhil Das
Here's a complete example https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Thanks Best Regards On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya godo...@gmail.com wrote: Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-06 Thread Tristan Blakers
Hi Imran, I had tried setting a really huge kryo buffer size (GB), but it didn’t make any difference. In my data sets, objects are no more than 1KB each, and don’t form a graph, so I don’t think the buffer size should need to be larger than a few MB, except perhaps for reasons of efficiency?

Creating topology in spark streaming

2015-05-06 Thread anshu shukla
Please help guys, Even After going through all the examples given i have not understood how to pass the D-streams from one bolt/logic to other (without writing it on HDFS etc.) just like emit function in storm . Suppose i have topology with 3 bolts(say) *BOLT1(parse the tweets nd emit tweet

Re: Using spark streaming to load data from Kafka to HDFS

2015-05-06 Thread Rendy Bambang Junior
Because using spark streaming looks like a lot simpler. Whats the difference between Camus and Kafka Streaming for this case? Why Camus excel? Rendy On Wed, May 6, 2015 at 2:15 PM, Saisai Shao sai.sai.s...@gmail.com wrote: Also Kafka has a Hadoop consumer API for doing such things, please

RE: Creating topology in spark streaming

2015-05-06 Thread Evo Eftimov
The “abstraction level” of Storm or shall we call it Architecture, is effectively Pipelines of Nodes/Agents – Pipelines is one of the standard Parallel Programming Patterns which you can use on multicore CPUs as well as Distributed Systems – the chaps from Storm simply implemented it as a

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread twinkle sachdeva
Hi, Can you please share your compression etc settings, which you are using. Thanks, Twinkle On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: I'm facing this error in Spark 1.3.1 https://issues.apache.org/jira/browse/SPARK-4105 Anyone knows what's the

Re: Receiver Fault Tolerance

2015-05-06 Thread James King
Many thanks all, your responses have been very helpful. Cheers On Wed, May 6, 2015 at 2:14 PM, ayan guha guha.a...@gmail.com wrote: https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics On Wed, May 6, 2015 at 10:09 PM, James King

No space left on device??

2015-05-06 Thread Yifan LI
Hi, I am running my graphx application on Spark, but it failed since there is an error on one executor node(on which available hdfs space is small) that “no space left on device”. I can understand why it happened, because my vertex(-attribute) rdd was becoming bigger and bigger during

Kryo read method never called before reducing

2015-05-06 Thread Florian Hussonnois
Hi, We use Spark to build graphs of events after querying cassandra. We use mapPartition for both aggregating events and building two graphs per partition. Graphs are returned as Tuple2 as follows : val nodes = events.mapPartitions(part = { var nodeLeft : Node = null var

Receiver Fault Tolerance

2015-05-06 Thread James King
In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation It talks about 'Receiver Fault Tolerance' I'm unsure of what a Receiver is here, from reading it sounds like when you submit an application to the cluster in cluster mode i.e. *--deploy-mode cluster *the driver program will

Re: Receiver Fault Tolerance

2015-05-06 Thread Tathagata Das
Incorrect. The receiver runs in an executor just like a any other tasks. In the cluster mode, the driver runs in a worker, however it launches executors in OTHER workers in the cluster. Its those executors running in other workers that run tasks, and also the receivers. On Wed, May 6, 2015 at

RE: Receiver Fault Tolerance

2015-05-06 Thread Evo Eftimov
This is about Kafka Receiver IF you are using Spark Streaming Ps: that book is now behind the curve in a quite a few areas since the release of 1.3.1 – read the documentation and forums From: James King [mailto:jakwebin...@gmail.com] Sent: Wednesday, May 6, 2015 1:09 PM To: user

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think for executor distribution, normally On YARN mode, RM tries its best to evenly distribute container if don't explicitly specify the preferred host. For standalone mode, one node only has one executor normally, so executor distribution is not a big problem normally. The problem of data skew

union eatch streaming window into a static rdd and use the static rdd periodicity

2015-05-06 Thread lisendong
the pseudo code : object myApp { var myStaticRDD: RDD[Int] def main() { ... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path //complex transformation using the two DStream val new_stream = streamA.transformWith(StreamB, (a, b, t) = {

Re: No space left on device??

2015-05-06 Thread Yifan LI
Thanks, Shao. :-) I am wondering if the spark will rebalance the storage overhead in runtime…since still there is some available space on other nodes. Best, Yifan LI On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com wrote: I think you could configure multiple disks through

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-06 Thread Emre Sevinc
Imran, Gerard, Indeed your suggestions were correct and it helped me. Thank you for your replies. -- Emre On Tue, May 5, 2015 at 4:24 PM, Imran Rashid iras...@cloudera.com wrote: Gerard is totally correct -- to expand a little more, I think what you want to do is a

Spark Kryo read method never called before reducing

2015-05-06 Thread amine_901
We use Spark to build graphs of events after querying cassandra. We use mapPartition for both aggregating events and building two graphs per partition. Graphs are returned as Tuple2 as follows : val nodes = events.mapPartitions(part = { var nodeLeft : Node = null var

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-06 Thread Imran Rashid
oh yeah, I think I remember we discussed this a while back ... sorry I forgot the details. If you know you don't have a graph, did you try setting spark.kryo.referenceTracking to false? I'm also confused on how you could hit this with a few million objects. Are you serializing them one at a

how to use rdd.countApprox

2015-05-06 Thread Du Li
I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it?  The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can I retrieve the count value by

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think you could configure multiple disks through spark.local.dir, default is /tmp. Anyway if your intermediate data is larger than available disk space, still will meet this issue. spark.local.dir/tmpDirectory to use for scratch space in Spark, including map output files and RDDs that get

Re: No space left on device??

2015-05-06 Thread Yifan LI
Yes, you are right. For now I have to say the workload/executor is distributed evenly…so, like you said, it is difficult to improve the situation. However, have you any idea of how to make a *skew* data/executor distribution? Best, Yifan LI On 06 May 2015, at 15:13, Saisai Shao

Re: Receiver Fault Tolerance

2015-05-06 Thread ayan guha
https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics On Wed, May 6, 2015 at 10:09 PM, James King jakwebin...@gmail.com wrote: In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation It talks about 'Receiver Fault Tolerance' I'm

Re: No space left on device??

2015-05-06 Thread Saisai Shao
I think it depends on your workload and executor distribution, if your workload is evenly distributed without any big data skew, and executors are evenly distributed on each nodes, the storage usage of each node is nearly the same. Spark itself cannot rebalance the storage overhead as you

Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
Thank you for your response, however, I'm afraid I still can't get it to work, this is my code: jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar' spark_config = SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars, jar_path) sc =

[no subject]

2015-05-06 Thread anshu shukla
Exception with sample testing in Intellij IDE: Exception in thread main java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class at akka.util.Collections$EmptyImmutableSeq$.init(Collections.scala:15) at akka.util.Collections$EmptyImmutableSeq$.clinit(Collections.scala) at

RE: DataFrame DSL documentation

2015-05-06 Thread Ulanov, Alexander
+1 I had to browse spark-catalyst sources to find what is supported: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala Alexander From: Gerard Maas [mailto:gerard.m...@gmail.com] Sent: Wednesday, May 06, 2015 11:42 AM To: spark

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
With binaryRecords it loads the file by line into RDD. With binaryFiles it provides an input stream, so it is up to you if you want to load everything into memory. Though documentation does not suggest to use this function for large files. From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]

Re: Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-06 Thread Michael Armbrust
I don't think that works: https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration On Tue, May 5, 2015 at 6:25 PM, nitinkak001 nitinkak...@gmail.com wrote: I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with

Re:

2015-05-06 Thread Shixiong Zhu
You are using Scala 2.11 with 2.10 libraries. You can change org.apache.spark % spark-streaming_2.10 % 1.3.1 to org.apache.spark %% spark-streaming % 1.3.1 And sbt will use the corresponding libraries according to your Scala version. Best Regards, Shixiong Zhu 2015-05-06 16:21 GMT-07:00

Re: Error in SparkSQL/Scala IDE

2015-05-06 Thread Michael Armbrust
Hi Iulian, The relevant code is in ScalaReflection https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala, and it would be awesome if you could suggest how to fix this more generally. Specifically, this code is also broken when

How to specify Worker and Master LOG folders?

2015-05-06 Thread Ulanov, Alexander
Hi, How can I specify Worker and Master LOG folders? If I set SPARK_WORKER_DIR in spark-env, it only affects Executor logs and shuffling folder. But Worker and Master logs still goes to something default: starting org.apache.spark.deploy.master.Master, logging to

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Jonathan Coveney
Can you check your local and remote logs? 2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com: This problem happen in Spark 1.3.1. It happen when two jobs are running simultaneously each in its own Spark Context. I don’t remember seeing this bug in Spark

Re: Reading large files

2015-05-06 Thread Vijayasarathy Kannan
Thanks. In both cases, does the driver need to have enough memory to contain the entire file? How do both these functions work when, for example, the binary file is 4G and available driver memory is lesser? On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander alexander.ula...@hp.com wrote:

question about the TFIDF.

2015-05-06 Thread Dan Dong
Hi, All, When I try to follow the document about tfidf from: http://spark.apache.org/docs/latest/mllib-feature-extraction.html val conf = new SparkConf().setAppName(TFIDF) val sc=new SparkContext(conf) val

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
This problem happen in Spark 1.3.1. It happen when two jobs are running simultaneously each in its own Spark Context. I don’t remember seeing this bug in Spark 1.2.0. Is it a new bug introduced in Spark 1.3.1? Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, May 06, 2015

Re: Multilabel Classification in spark

2015-05-06 Thread Peter Garbers
Thanks for all your feedback! I'm a little new to scala/spark so hopefully you'll bare with me while I try to explain how I plan to go about this and give me advice as to why this may or may not work. My terminology may be a little incorrect as well. Any feedback would be greatly appreciated.

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-06 Thread Ravi Mody
Whoops I just saw this thread, it got caught in my spam filter. Thanks for looking into this Xiangrui and Sean. The implicit situation does seem fairly complicated to me. The cost function (not including the regularization term) is affected both by the number of ratings and by the number of

spark-shell breaks for scala 2.11 (with yarn)?

2015-05-06 Thread Koert Kuipers
hello all, i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala 2.11. i am running on a secure cluster. the deployment configs are identical. i can launch jobs just fine on both the scala 2.10 and scala 2.11 versions. spark-shell works on the scala 2.10 version, but not on

Re: How to add jars to standalone pyspark program

2015-05-06 Thread mj
I've worked around this by dropping the jars into a directory (spark_jars) and then creating a spark-defaults.conf file in conf containing this: spark.driver.extraClassPath/home/mj/apps/spark_jars/* -- View this message in context:

Reading large files

2015-05-06 Thread Vijayasarathy Kannan
​Hi, Is there a way to read a large file, in parallel​/distributed way? I have a single large binary file which I currently read on the driver program and then distribute it to executors (using groupBy(), etc.). I want to know if there's a way to make the executors each read a specific/unique

RE: Reading large files

2015-05-06 Thread Ulanov, Alexander
SparkContext has two methods for reading binary files: binaryFiles (reads multiple binary files into RDD) and binaryRecords (reads separate lines of a single binary file into RDD). For example, I have a big binary file split into logical parts, so I can use “binaryFiles”. The possible problem

RE: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-06 Thread java8964
It looks like you have data in these 24 partitions, or more. How many unique name in your data set? Enlarge the shuffle partitions only make sense if you have large partition groups in your data. What you described looked like either your dataset having data in these 24 partitions, or you have

How update counter in cassandra

2015-05-06 Thread Sergio Jiménez Barrio
I have a Counter family colums in Cassandra. I want update this counters with a aplication in spark Streaming. How can I update counter cassandra with Spark? Thanks.

YARN mode startup takes too long (10+ secs)

2015-05-06 Thread Taeyun Kim
Hi, I'm running a spark application with YARN-client or YARN-cluster mode. But it seems to take too long to startup. It takes 10+ seconds to initialize the spark context. Is this normal? Or can it be optimized? The environment is as follows: - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva twinkle.sachd...@gmail.com wrote: Hi, Can you please share your compression etc settings, which you are using. Thanks, Twinkle On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com

Re: How update counter in cassandra

2015-05-06 Thread Tathagata Das
This may help. http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: I have a Counter family colums in Cassandra. I want update this counters

Spark 1.3.1 and Parquet Partitions

2015-05-06 Thread vasuki
Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some

Spark updateStateByKey fails with class leak when using case classes - resend

2015-05-06 Thread rsearle
Apologies for the repeat. The first was rejected by the submission process I created a simple Spark streaming program using updateStateByKey. The domain is represented by case classes for clarity, type safety, etc. Spark job continuously loads new classes, which are removed by GC to maintain a

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Ted Yu
Which release of Spark are you using ? Thanks On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I run a job on spark standalone cluster and got the exception below Here is the line of code that cause problem val myRdd: RDD[(String, String,

java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-06 Thread Wang, Ningjun (LNG-NPV)
I run a job on spark standalone cluster and got the exception below Here is the line of code that cause problem val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory, path) myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER) val cats: Array[String] = myRdd.map(t =

How can I force operations to complete and spool to disk

2015-05-06 Thread Steve Lewis
I am performing a job where I perform a number of steps in succession. One step is a map on a JavaRDD which generates objects taking up significant memory. The this is followed by a join and an aggregateByKey. The problem is that the system is running getting OutOfMemoryErrors - Most tasks work

Re: AvroFiles

2015-05-06 Thread ๏̯͡๏
Hello, This is how i read Avro data. import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.AvroKey import org.apache.avro.Schema import org.apache.hadoop.io.NullWritable import org.apache.avro.mapreduce.AvroKeyInputFormat -- Read