Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Thanks RK. I can turn on speculative execution but I am trying to find out actual reason for delay as it happens on any node. Any idea about the stack trace in my previous mail. Regards,Ajay On Thursday, January 15, 2015 8:02 PM, RK prk...@yahoo.com.INVALID wrote: If you don't want

Re: small error in the docs?

2015-01-15 Thread Sean Owen
Yes that's a typo. The API docs and source code are correct though. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions That and your IDE should show the correct signature. You can open a PR to fix the typo in

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
Replying to all Is this Overhead memory allocation used for any specific purpose. For example, will it be any different if I do *--executor-memory 22G *with overhead set to 0%(hypothetically) vs *--executor-memory 20G* and overhead memory to default(9%) which eventually brings the total

Is spark suitable for large scale pagerank, such as 200 million nodes, 2 billion edges?

2015-01-15 Thread txw
Hi, I am run PageRank on a large dataset, which include 200 million nodes and 2 billion edges? Isspark suitable for large scale pagerank? How many cores and MEM do I need and how long will it take? Thanks Xuewei Tang

Re: Testing if an RDD is empty?

2015-01-15 Thread Al M
You can also check rdd.partitions.size. This will be 0 for empty RDDs and 0 for RDDs with data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-tp1678p21170.html Sent from the Apache Spark User List mailing list archive at

Re: spark crashes on second or third call first() on file

2015-01-15 Thread Davies Liu
What's the version of Spark you are using? On Wed, Jan 14, 2015 at 12:00 AM, Linda Terlouw linda.terl...@icris.nl wrote: I'm new to Spark. When I use the Movie Lens dataset 100k (http://grouplens.org/datasets/movielens/), Spark crashes when I run the following code. The first call to

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Sean Owen
If you give the executor 22GB, it will run with ... -Xmx22g. If the JVM heap gets nearly full, it will almost certainly consume more than 22GB of physical memory, because the JVM needs memory for more than just heap. But in this scenario YARN was only asked for 22GB and it gets killed. This is

Re: Some tasks are taking long time

2015-01-15 Thread Nicos
Ajay, Unless we are dealing with some synchronization/conditional variable bug in Spark, try this per tuning guide: Cache Size Tuning One important configuration parameter for GC is the amount of memory that should be used for caching RDDs. By default, Spark uses 60% of the configured

using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread BB
Hi all, Any help on the following is very much appreciated. = Problem: On a schemaRDD read from a parquet file (data within file uses AVRO model) using the HiveContext: I can't figure out how to 'select' or use 'where' clause, to filter rows on a field that

Re: dockerized spark executor on mesos?

2015-01-15 Thread Nicholas Chammas
The AMPLab maintains a bunch of Docker files for Spark here: https://github.com/amplab/docker-scripts Hasn't been updated since 1.0.0, but might be a good starting point. On Wed Jan 14 2015 at 12:14:13 PM Josh J joshjd...@gmail.com wrote: We have dockerized Spark Master and worker(s)

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Sean Owen
This is a YARN setting. It just controls how much any container can reserve, including Spark executors. That is not the problem. You need Spark to ask for more memory from YARN, on top of the memory that is requested by --executor-memory. Your output indicates the default of 7% is too little. For

Re: SQL JSON array operations

2015-01-15 Thread Ayoub Benali
You could try yo use hive context which bring HiveQL, it would allow you to query nested structures using LATERAL VIEW explode... On Jan 15, 2015 4:03 PM, jvuillermet jeremy.vuiller...@gmail.com wrote: let's say my json file lines looks like this {user: baz, tags : [foo, bar] }

Re: dockerized spark executor on mesos?

2015-01-15 Thread Tim Chen
Just throwing this out here, there is existing PR to add docker support for spark framework to launch executors with docker image. https://github.com/apache/spark/pull/3074 Hopefully this will be merged sometime. Tim On Thu, Jan 15, 2015 at 9:18 AM, Nicholas Chammas nicholas.cham...@gmail.com

Re: Is spark suitable for large scale pagerank, such as 200 million nodes, 2 billion edges?

2015-01-15 Thread Ted Yu
Have you seen http://search-hadoop.com/m/JW1q5pE3P12 ? Please also take a look at the end-to-end performance graph on http://spark.apache.org/graphx/ Cheers On Thu, Jan 15, 2015 at 9:29 AM, txw t...@outlook.com wrote: Hi, I am run PageRank on a large dataset, which include 200 million

SQL JSON array operations

2015-01-15 Thread jvuillermet
let's say my json file lines looks like this {user: baz, tags : [foo, bar] } sqlContext.jsonFile(data.json) ... How could I query for user with bar tags using SQL sqlContext.sql(select user from users where tags ?contains? 'bar' ) I could simplify the request and use the returned RDD to

Re: Inserting an element in RDD[String]

2015-01-15 Thread Hafiz Mujadid
thanks On Thu, Jan 15, 2015 at 7:35 PM, Prannoy [via Apache Spark User List] ml-node+s1001560n21163...@n3.nabble.com wrote: Hi, You can take the schema line in another rdd and than do a union of the two rdd . ListString schemaList = new ArrayListString; schemaList.add(xyz); // where

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Sean Owen
Those settings aren't relevant, I think. You're concerned with what your app requests, and what Spark requests of YARN on your behalf. (Of course, you can't request more than what your cluster allows for a YARN container for example, but that doesn't seem to be what is happening here.) You do not

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
I am sorry for the formatting error, the value for *yarn.scheduler.maximum-allocation-mb = 28G* On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak nitinkak...@gmail.com wrote: Thanks for sticking to this thread. I am guessing what memory my app requests and what Yarn requests on my part should be

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
Thanks for sticking to this thread. I am guessing what memory my app requests and what Yarn requests on my part should be same and is determined by the value of *--executor-memory* which I had set to *20G*. Or can the two values be different? I checked in Yarn configurations(below), so I think

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread Prannoy
Hi, You can use FileUtil.copyMerge API and specify the path to the folder where saveAsTextFile is save the part text file. Suppose your directory is /a/b/c/ use FileUtil.copyMerge(FileSystem of source, a/b/c, FileSystem of destination, Path to the merged file say (a/b/c.txt), true(to delete the

small error in the docs?

2015-01-15 Thread kirillfish
cogroup() function seems to return (K, (IterableV, IterableW)), rather than (K, IterableV, IterableW), as it is pointed out in the docs (at least for version 1.1.0): https://spark.apache.org/docs/1.1.0/programming-guide.html https://spark.apache.org/docs/1.1.0/programming-guide.html This

Re: Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Marcelo Vanzin
You're specifying the queue in the spark-submit command line: --queue thequeue Are you sure that queue exists? On Thu, Jan 15, 2015 at 11:23 AM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Setup is as follows Hadoop Cluster 2.3.0 (CDH5.0) - Namenode HA - Resource manager HA -

Re: SQL JSON array operations

2015-01-15 Thread jvuillermet
yeah that's where I ended up. Thanks ! I'll give it a try. On Thu, Jan 15, 2015 at 8:46 PM, Ayoub [via Apache Spark User List] ml-node+s1001560n21172...@n3.nabble.com wrote: You could try to use hive context which bring HiveQL, it would allow you to query nested structures using LATERAL VIEW

Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
Hi All, I am testing Spark on EMR cluster. Env is a one node cluster r3.8xlarge. Has 32 vCore and 244G memory. But the command line I use to start up spark-shell, it can't work. For example: ~/spark/bin/spark-shell --jars /home/hadoop/vrisc-lib/aws-java-sdk-1.9.14/lib/*.jar

Re: SQL JSON array operations

2015-01-15 Thread Ayoub
You could try to use hive context which bring HiveQL, it would allow you to query nested structures using LATERAL VIEW explode... see doc https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView here -- View this message in context:

Executor vs Mapper in Hadoop

2015-01-15 Thread Shuai Zheng
Hi All, I try to clarify some behavior in the spark for executor. Because I am from Hadoop background, so I try to compare it to the Mapper (or reducer) in hadoop. 1, Each node can have multiple executors, each run in its own process? This is same as mapper process. 2, I thought the

How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Wang, Ningjun (LNG-NPV)
I have a standalone spark cluster with only one node with 4 CPU cores. How can I force spark to do parallel processing of my RDD using multiple threads? For example I can do the following Spark-submit --master local[4] However I really want to use the cluster as follow Spark-submit --master

Re: Testing if an RDD is empty?

2015-01-15 Thread freedafeng
I think Sampo's thought is to get a function that only tests if a RDD is empty. He does not want to know the size of the RDD, and getting the size of a RDD is expensive for large data sets. I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not

Re: different akka versions and spark

2015-01-15 Thread Koert Kuipers
and to make things even more interesting: The CDH *5.3* version of Spark 1.2 differs from the Apache Spark 1.2 release in using Akka version 2.2.3, the version used by Spark 1.1 and CDH 5.2. Apache Spark 1.2 uses Akka version 2.3.4. so i just compiled a program that uses akka against apache

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread jamborta
thanks for the replies. very useful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark streaming python files not packaged in assembly jar

2015-01-15 Thread jamborta
Hi all, just discovered that the streaming folder in pyspark is not included in the assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the python folder. Any reason why? Thanks, -- View this message in context:

RE: Executor parameter doesn't work for Spark-shell on EMR Yarn

2015-01-15 Thread Shuai Zheng
I figure out the second question, because if I don't pass in the num of partition for the test data, it will by default assume has max executors (although I don't know what is this default max num). val lines = sc.parallelize(List(-240990|161327,9051480,0,2,30.48,75,

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread Cheng, Hao
Hi, BB Ideally you can do the query like: select key, value.percent from mytable_data lateral view explode(audiences) f as key, value limit 3; But there is a bug in HiveContext: https://issues.apache.org/jira/browse/SPARK-5237 I am working on it now, hopefully make a patch soon. Cheng

Re: saveAsTextFile

2015-01-15 Thread ankits
I have seen this happen when the RDD contains null values. Essentially, saveAsTextFile calls toString() on the elements of the RDD, so a call to null.toString will result in an NPE. -- View this message in context:

Anyway to make RDD preserve input directories structures?

2015-01-15 Thread 逸君曹
say there's some logs: s3://log-collections/sys1/20141212/nginx.gz s3://log-collections/sys1/20141213/nginx-part-1.gz s3://log-collections/sys1/20141213/nginx-part-2.gz I have a function that parse the logs for later analysis. I want to parse all the files. So I do this: logs =

Re: Testing if an RDD is empty?

2015-01-15 Thread Tobias Pfeiffer
Hi, On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote: I myself saw many times that my app threw out exceptions because an empty RDD cannot be saved. This is not big issue, but annoying. Having a cheap solution testing if an RDD is empty would be nice if there is no such

Re: Executor vs Mapper in Hadoop

2015-01-15 Thread Sean Owen
An executor is specific to a Spark application, just as a mapper is specific to a MapReduce job. So a machine will usually be running many executors, and each is a JVM. A Mapper is single-threaded; an executor can run many tasks (possibly from different jobs within the application) at once. Yes,

Re: Testing if an RDD is empty?

2015-01-15 Thread Sean Owen
How about checking whether take(1).length == 0? If I read the code correctly, this will only examine the first partition, at least. On Fri, Jan 16, 2015 at 4:12 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Jan 16, 2015 at 7:31 AM, freedafeng freedaf...@yahoo.com wrote: I myself

Re: Accumulators

2015-01-15 Thread Imran Rashid
You're understanding is basically correct. Each task creates it's own local accumulator, and just those results get merged together. However, there are some performance limitations to be aware of. First you need enough memory on the executors to build up whatever those intermediate results are.

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Thanks Nicos.GC does not contribute much to the execution time of the task. I will debug it further today. Regards,Ajay On Thursday, January 15, 2015 11:55 PM, Nicos n...@hotmail.com wrote: Ajay, Unless we are dealing with some synchronization/conditional variable bug in Spark, try

OutOfMemory error in Spark Core

2015-01-15 Thread Anand Mohan
We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray. We are using Kryo serializer for the Avro objects read from Parquet and we are using our custom Kryo registrator (along the lines of ADAM

Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Corey Nolet
I have document storage services in Accumulo that I'd like to expose to Spark SQL. I am able to push down predicate logic to Accumulo to have it perform only the seeks necessary on each tablet server to grab the results being asked for. I'm interested in using Spark SQL to push those predicates

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread mykidong
Hi Dibyendu, I am using kafka 0.8.1.1 and spark 1.2.0. After modifying these version of your pom, I have rebuilt your codes. But I have not got any messages from ssc.receiverStream(new KafkaReceiver(_props, i)). I have found, in your codes, all the messages are retrieved correctly, but

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
Hi Kidong, No , I have not tried yet with Spark 1.2 yet. I will try this out and let you know how this goes. By the way, is there any change in Receiver Store method happened in Spark 1.2 ? Regards, Dibyendu On Fri, Jan 16, 2015 at 11:25 AM, mykidong mykid...@gmail.com wrote: Hi

Re: OutOfMemory error in Spark Core

2015-01-15 Thread Akhil Das
Did you try increasing the parallelism? Thanks Best Regards On Fri, Jan 16, 2015 at 10:41 AM, Anand Mohan chinn...@gmail.com wrote: We have our Analytics App built on Spark 1.1 Core, Parquet, Avro and Spray. We are using Kryo serializer for the Avro objects read from Parquet and we are using

Re: How to query Spark master for cluster status ?

2015-01-15 Thread Akhil Das
There's a JSON end point in the Web UI ( that running on port 8080), http://masterip:8080/json/ Thanks Best Regards On Thu, Jan 15, 2015 at 6:30 PM, Shing Hing Man mat...@yahoo.com.invalid wrote: Hi, I am using Spark 1.2. The Spark master UI has a status. Is there a web service on the

Re: PySpark saveAsTextFile gzip

2015-01-15 Thread Akhil Das
You can use the saveAsNewAPIHadoop http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile file. You can use it for compressing your output, here's a sample code https://github.com/ScrapCodes/spark-1/blob/master/python/pyspark/tests.py#L1225 to use the API.

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Dibyendu Bhattacharya
Hi Kidong, Just now I tested the Low Level Consumer with Spark 1.2 and I did not see any issue with Receiver.Store method . It is able to fetch messages form Kafka. Can you cross check other configurations in your setup like Kafka broker IP , topic name, zk host details, consumer id etc. Dib

RE: Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Cheng, Hao
The Data Source API probably work for this purpose. It support the column pruning and the Predicate Push Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test:

UserGroupInformation: PriviledgedActionException as

2015-01-15 Thread spraveen
Hi, When I am trying to run a program in a remote spark machine I am getting this below exception : 15/01/16 11:14:39 ERROR UserGroupInformation: PriviledgedActionException as:user1 (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in

Re: Low Level Kafka Consumer for Spark

2015-01-15 Thread Akhil Das
There was a simple example https://github.com/dibbhatt/kafka-spark-consumer/blob/master/examples/scala/LowLevelKafkaConsumer.scala#L45 which you can run after changing few lines of configurations. Thanks Best Regards On Fri, Jan 16, 2015 at 12:23 PM, Dibyendu Bhattacharya

MatchError in JsonRDD.toLong

2015-01-15 Thread Tobias Pfeiffer
Hi, I am experiencing a weird error that suddenly popped up in my unit tests. I have a couple of HDFS files in JSON format and my test is basically creating a JsonRDD and then issuing a very simple SQL query over it. This used to work fine, but now suddenly I get: 15:58:49.039 [Executor task

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-15 Thread Nathan McCarthy
Thanks Cheng! Is there any API I can get access too (e.g. ParquetTableScan) which would allow me to load up the low level/baseRDD of just RDD[Row] so I could avoid the defensive copy (maybe lose our on columnar storage etc.). We have parts of our pipeline using SparkSQL/SchemaRDDs and others

Re: Serializability: for vs. while loops

2015-01-15 Thread Aaron Davidson
Scala for-loops are implemented as closures using anonymous inner classes which are instantiated once and invoked many times. This means, though, that the code inside the loop is actually sitting inside a class, which confuses Spark's Closure Cleaner, whose job is to remove unused references from

Visualize Spark Job

2015-01-15 Thread Kuromatsu, Nobuyuki
Hi I want to visualize tasks and stages in order to analyze spark jobs. I know necessary metrics is written in spark.eventLog.dir. Does anyone know the tool like swimlanes in Tez? Regards, Nobuyuki Kuromatsu - To unsubscribe,

Error connecting to localhost:8060: java.net.ConnectException: Connection refused

2015-01-15 Thread Gautam Bajaj
Hi, I'm new to Apache Storm. I'm receiving data at my UDP port 8060, I want to capture it and perform some operations in the real time, for which I'm using Spark Streaming. While the code seems to be correct, I get the following output: https://gist.github.com/d34th4ck3r/0e88896eac864d6d7193

Re: Fast HashSets HashMaps - Spark Collection Utils

2015-01-15 Thread Sean Owen
A recent discussion says these won't be public. However there are many optimized collection libs in Java. My favorite is Koloboke: https://github.com/OpenHFT/Koloboke/wiki/Koloboke:-roll-the-collection-implementation-with-features-you-need Carrot HPPC is good too. The only catch is that the

Re: ScalaReflectionException when using saveAsParquetFile in sbt

2015-01-15 Thread Pierre B
Same problem here... Did u find a solution for this? P. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ScalaReflectionException-when-using-saveAsParquetFile-in-sbt-tp21020p21150.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark fault tolerance mechanism

2015-01-15 Thread YANG Fan
Hi, I'm quite interested in how Spark's fault tolerance works and I'd like to ask a question here. According to the paper, there are two kinds of dependencies--the wide dependency and the narrow dependency. My understanding is, if the operations I use are all narrow, then when one machine

kinesis creating stream scala code exception

2015-01-15 Thread Hafiz Mujadid
Hi, Expert I want to consumes data from kinesis stream using spark streaming. I am trying to create kinesis stream using scala code. Here is my code def main(args: Array[String]) { println(Stream creation started) if(create(2)) println(Stream is created successfully)

PySpark saveAsTextFile gzip

2015-01-15 Thread Tom Seddon
Hi, I've searched but can't seem to find a PySpark example. How do I write compressed text file output to S3 using PySpark saveAsTextFile? Thanks, Tom

Re: MissingRequirementError with spark 1.2

2015-01-15 Thread sarsol
I am also getting the same error after 1.2 upgrade. application is crashing on this line rdd.registerTempTable(temp) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21152.html Sent from the Apache Spark User List

Re: MissingRequirementError with spark

2015-01-15 Thread Pierre B
I found this, which might be useful: https://github.com/deanwampler/spark-workshop/blob/master/project/Build.scala I seems that forking is needed. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MissingRequirementError-with-spark-tp21149p21153.html Sent

Re: Serializability: for vs. while loops

2015-01-15 Thread Tobias Pfeiffer
Aaron, thanks for your mail! On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote: Scala for-loops are implemented as closures using anonymous inner classes [...] While loops, on the other hand, involve none of this trickery, and everyone is happy. Ah, I was suspecting

Re: RowMatrix multiplication

2015-01-15 Thread Toni Verbeiren
You can always define an RDD transpose function yourself. This is what I use in PySpark to transpose an RDD of numpy vectors. It’s not optimal and the vectors need to fit in memory on the worker nodes. def rddTranspose(rdd): # add an index to the rows and the columns, result in triplet

Re: *ByKey aggregations: performance + order

2015-01-15 Thread Sean Owen
I'm interested too and don't know for sure but I do not think this case is optimized this way. However if you know your keys aren't split across partitions and you have small enough partitions you can implement the same grouping with mapPartitions and Scala. On Jan 15, 2015 1:27 AM, Tobias

Spark-Sql not using cluster slaves

2015-01-15 Thread ahmedabdelrahman
I have been using Spark sql in cluster mode and I am noticing no distribution and parallelization of the query execution. The performance seems to be very slow compared to native spark applications and does not offer any speedup when compared to HIVE. I am using Spark 1.1.0 with a cluster of 5

How to query Spark master for cluster status ?

2015-01-15 Thread Shing Hing Man
Hi,   I am using Spark 1.2.  The Spark master UI has a status.Is there a web service on the Spark master that returns the status of the cluster in Json ? Alternatively, what is the best way to determine if  a cluster is up. Thanks in advance for your assistance! Shing

Re: kinesis creating stream scala code exception

2015-01-15 Thread Aniket Bhatnagar
Are you using spark in standalone mode or yarn or mesos? If its yarn, please mention the hadoop distribution and version. What spark distribution are you using (it seems 1.2.0 but compiled with which hadoop version)? Thanks, Aniket On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid

Using native blas with mllib

2015-01-15 Thread lev
Hi, I'm trying to use the native blas, and I followed all the threads I saw here and I still can't get rid of those warning: WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS WARN netlib.BLAS: Failed to load implementation from:

Re: Using Spark SQL with multiple (avro) files

2015-01-15 Thread David Jones
I've tried this now. Spark can load multiple avro files from the same directory by passing a path to a directory. However, passing multiple paths separated with commas didn't work. Is there any way to load all avro files in multiple directories using sqlContext.avroFile? On Wed, Jan 14, 2015 at

Re: MissingRequirementError with spark

2015-01-15 Thread sarsol
added fork :=true in Scala Build. Commandline sbt is working fine but Eclipse SCALA IDE is still giving same error. This was all working fine untill Spark 1.1. -- View this message in context:

Inserting an element in RDD[String]

2015-01-15 Thread Hafiz Mujadid
hi experts! I hav an RDD[String] and i want to add schema line at beginning in this rdd. I know RDD is immutable. So is there anyway to have a new rdd with one schema line and contents of previous rdd? Thanks -- View this message in context:

Re: How to define SparkContext with Cassandra connection for spark-jobserver?

2015-01-15 Thread abhishek
In the spark job server* bin *folder, you will find* application.conf* file, put context-settings { spark.cassandra.connection.host = ur address } Hope this should work -- View this message in context:

Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Hi, My spark job is taking long time. I see that some tasks are taking longer time for same amount of data and shuffle read/write. What could be the possible reasons for it ? The thread-dump sometimes show that all the tasks in an executor are waiting with following stack trace - Executor task

Re: saveAsTextFile

2015-01-15 Thread Prannoy
Hi, Before saving the rdd do a collect to the rdd and print the content of the rdd. Probably its a null value. Thanks. On Sat, Jan 3, 2015 at 5:37 PM, Pankaj Narang [via Apache Spark User List] ml-node+s1001560n20953...@n3.nabble.com wrote: If you can paste the code here I can certainly

Re: Failing jobs runs twice

2015-01-15 Thread Anders Arpteg
Found a setting that seems to fix this problem, but it does not seems to be available until Spark 1.3. See https://issues.apache.org/jira/browse/SPARK-2165 However, glad to see a work is being done with the issue. On Tue, Jan 13, 2015 at 8:00 PM, Anders Arpteg arp...@spotify.com wrote: Yes

Error when running SparkPi on Secure HA Hadoop cluster

2015-01-15 Thread Manoj Samel
Hi, Setup is as follows Hadoop Cluster 2.3.0 (CDH5.0) - Namenode HA - Resource manager HA - Secured with Kerberos Spark 1.2 Run SparkPi as follows - conf/spark-defaults.conf has following entries spark.yarn.queue myqueue spark.yarn.access.namenodes hdfs://namespace (remember this is namenode

Re: Inserting an element in RDD[String]

2015-01-15 Thread Aniket Bhatnagar
Sure there is. Create a new RDD just containing the schema line (hint: use sc.parallelize) and then union both the RDDs (the header RDD and data RDD) to get a final desired RDD. On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid hafizmujadi...@gmail.com wrote: hi experts! I hav an RDD[String] and i

Re: Some tasks are taking long time

2015-01-15 Thread RK
If you don't want a few slow tasks to slow down the entire job, you can turn on speculation.  Here are the speculation settings from Spark Configuration - Spark 1.2.0 Documentation. |   | |   |   |   |   |   | | Spark Configuration - Spark 1.2.0 DocumentationSpark Configuration Spark Properties

Re: Inserting an element in RDD[String]

2015-01-15 Thread Prannoy
Hi, You can take the schema line in another rdd and than do a union of the two rdd . ListString schemaList = new ArrayListString; schemaList.add(xyz); // where xyz is your schema line JavaRDD schemaRDDString = sc.parallize(schemaList) ; //where sc is your sparkcontext JavaRDD newRDDString =

Re: How to force parallel processing of RDD using multiple thread

2015-01-15 Thread Sean Owen
Check the number of partitions in your input. It may be much less than the available parallelism of your small cluster. For example, input that lives in just 1 partition will spawn just 1 task. Beyond that parallelism just happens. You can see the parallelism of each operation in the Spark UI.

Re: Anyway to make RDD preserve input directories structures?

2015-01-15 Thread Sean Owen
Maybe you are saying you already do this, but it's perfectly possible to process as many RDDs as you like in parallel on the driver. That may allow your current approach to eat up as much parallelism as you like. I'm not sure if that's what you are describing with submit multi applications but you

RE: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-15 Thread Bozeman, Christopher
Thanks to Aniket’s work there is two new options to the EMR install script for Spark. See https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md The “-a” option can be used to bump the spark-assembly to the front of the classpath. -Christopher From: Aniket Bhatnagar