Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
I tried val pairVarOriRDD = sc.newAPIHadoopFile(path, classOf[NetCDFFileInputFormat].asSubclass( classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[WRFIndex,WRFVariable]]), classOf[WRFIndex], classOf[WRFVariable], jobConf) The compiler does not

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska
I am not sure if your issue is setting the Fair mode correctly or something else so let's start with the FAIR mode. Do you see scheduler mode actually being set to FAIR: I have this line in spark-defaults.conf spark.scheduler.allocation.file=/spark/conf/fairscheduler.xml Then, when I start my

Spark NullPointerException

2015-02-25 Thread Máté Gulyás
Hi all, I am trying to run a Spark Java application on EMR, but I keep getting NullPointerException from the Application master (spark version on EMR: 1.2). The stacktrace is below. I also tried to run the application on Hortonworks Sandbox (2.2) with spark 1.2, following the blogpost

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Yiannis Gkoufas
What I think is happening that the map operations are executed concurrently and the map operation in rdd2 has the initial copy of myObjectBroadcated. Is there a way to apply the transformations sequentially? First materialize rdd1 and then rdd2. Thanks a lot! On 24 February 2015 at 18:49,

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-25 Thread Mukesh Jha
My application runs fine for ~3/4 hours and then hits this issue. On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hi Experts, My Spark Job is failing with below error. From the logs I can see that input-3-1424842351600 was added at 5:32:32 and was never purged

Number of Executors per worker process

2015-02-25 Thread Spico Florin
Hello! I've read the documentation about the spark architecture, I have the following questions: 1: how many executors can be on a single worker process (JMV)? 2:Should I think executor like a Java Thread Executor where the pool size is equal with the number of the given cores (set up by the

Re: throughput in the web console?

2015-02-25 Thread Josh J
Let me ask like this, what would be the easiest way to display the throughput in the web console? Would I need to create a new tab and add the metrics? Any good or simple examples showing how this can be done? On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Did you

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
For SparkStreaming applications, there is already a tab called Streaming which displays the basic statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 8:55 PM, Josh J joshjd...@gmail.com wrote: Let me ask like this, what would be the easiest way to display the throughput in the web

Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote: For SparkStreaming applications, there is already a tab called Streaming which displays the basic statistics. Would I just need to extend this tab to add the throughput?

Number of parallel tasks

2015-02-25 Thread Akshat Aranya
I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an RDD with ~90,000 partitions, I only get 4 parallel tasks. Shouldn't I be getting 4x5=20 parallel task executions?

Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Imran Rashid
Hi Yiannis, Broadcast variables are meant for *immutable* data. They are not meant for data structures that you intend to update. (It might *happen* to work when running local mode, though I doubt it, and it would probably be a bug if it did. It will certainly not work when running on a

Spark Standard Application to Test

2015-02-25 Thread danilopds
Hello, I am preparing some tests to execute in Spark in order to manipulate properties and check the variations in results. For this, I need to use a Standard Application in my environment like the well-known apps to Hadoop: Terasort

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
By throughput you mean Number of events processed etc? [image: Inline image 1] Streaming tab already have these statistics. Thanks Best Regards On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote: On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com wrote:

NegativeArraySizeException when doing joins on skewed data

2015-02-25 Thread soila
I have been running into NegativeArraySizeException's when doing joins on data with very skewed key distributions in Spark 1.2.0. I found a previous post that mentioned that this exception arises when the size of the blocks spilled during the shuffle exceeds 2GB. The post recommended increasing

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m and I made sparsity pattern 100:1.5M..we would like to increase the sparsity pattern to 1000:1.5M I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort shuffle will help.. I read in Reza paper that

NullPointerException in ApplicationMaster

2015-02-25 Thread gulyasm
Hi all, I am trying to run a Spark Java application on EMR, but I keep getting NullPointerException from the Application master (spark version on EMR: 1.2). The stacktrace is below. I also tried to run the application on Hortonworks Sandbox (2.2) with spark 1.2, following the blogpost

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Slim Baltagi
Hi all Here is another Spark talk (a vendor-independent one!) that you might have missed: 'The Future of Apache Hadoop' track: How Spark and Flink are shaping the future of Hadoop? https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074410 Regards, Slim

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Kannan Rajah
Cheng, We tried this setting and it still did not help. This was on Spark 1.2.0. -- Kannan On Mon, Feb 23, 2015 at 6:38 PM, Cheng Lian lian.cs@gmail.com wrote: (Move to user list.) Hi Kannan, You need to set mapred.map.tasks to 1 in hive-site.xml. The reason is this line of code

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote: Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could

Re: Hamburg Apache Spark Meetup

2015-02-25 Thread Petar Zecevic
Please add the Zagreb Meetup group, too. http://www.meetup.com/Apache-Spark-Zagreb-Meetup/ Thanks! On 18.2.2015. 19:46, Johan Beisser wrote: If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it. http://www.meetup.com/Hamburg-Apache-Spark-Meetup/ On Tue, Feb 17, 2015 at

Re: Unable to run hive queries inside spark

2015-02-25 Thread Michael Armbrust
It looks like that is getting interpreted as a local path. Are you missing a core-site.xml file to configure hdfs? On Tue, Feb 24, 2015 at 10:40 PM, kundan kumar iitr.kun...@gmail.com wrote: Hi Denny, yes the user has all the rights to HDFS. I am running all the spark operations with this

Re: throughput in the web console?

2015-02-25 Thread Otis Gospodnetic
Hi Josh, SPM will show you this info. I see you use Kafka, too, whose numerous metrics you can also see in SPM side by side with your Spark metrics. Sounds like trends is what you are after, so I hope this helps. See http://sematext.com/spm Otis On Feb 24, 2015, at 11:59, Josh J

Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Ted Yu
Interesting. Looking at SparkConf.scala : val configs = Seq( DeprecatedConfig(spark.files.userClassPathFirst, spark.executor.userClassPathFirst, 1.3), DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3, Use spark.{driver,executor}.userClassPathFirst

Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran, Thanks for your response. I noticed the intersection and subtract methods for a RDD, does they work based on hash off all the fields in a RDD record ? - Himanish On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid iras...@cloudera.com wrote: the more scalable alternative is to do a

Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
Look at the trace again. It is a very weird error. The SparkSubmit is running on client side, but YarnClusterSchedulerBackend is supposed in running in YARN AM. I suspect you are running the cluster with yarn-client mode, but in JavaSparkContext you set yarn-cluster”. As a result, spark

How to pass a org.apache.spark.rdd.RDD in a recursive function

2015-02-25 Thread dritanbleco
Hello i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a recursive function. This table should be changed in any step of the recursion and could not be just a global var need help :) Thank you -- View this message in context:

Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
I have this function in the driver program which collects the result from rdds (in a stream) into an array and return. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong? I can print the RDD values inside the foreachRDD call

RE: spark sql: join sql fails after sqlCtx.cacheTable()

2015-02-25 Thread tridib
Using Hivecontext solved it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p21807.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Error when running the terasort branche in a cluster

2015-02-25 Thread Tom
Not sure if this is the place to ask, but i am using the terasort branche of Spark for benchmarking, as found on https://github.com/ehiggs/spark/tree/terasort, and I get the error below when running on two machines (one machine works just fine). When looking at the code, listed below the error

spark standalone with multiple executors in one work node

2015-02-25 Thread Judy Nash
Hello, Does spark standalone support running multiple executors in one worker node? It seems yarn has the parameter --num-executors to set number of executors to deploy, but I do not find the equivalent parameter in spark standalone. Thanks, Judy

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore thanigai.vell...@gmail.com wrote: It appears that the function immediately returns even before the foreachrdd stage is executed. Is that possible? Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
I didn't include the complete driver code but I do run the streaming context from the main program which calls this function. Again, I can print the red elements within the foreachrdd block but the array that is returned is always empty. It appears that the function immediately returns even before

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Hi Reza, With 40 nodes and shuffle space managed by YARN over HDFS usercache we could run the similarity job without doing any thresholding...We used hash based shuffle and sort hopefully will further improve it...Note that this job was almost 6M x 1.5M We will go towards 50 M x ~ 3M columns and

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964
Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand some internal information of spark. From the web and this list, I keep seeing people talking about increase the parallelism if you get the OOM error. I tried to read document as much as possible to understand the RDD

Re: Spark cluster set up on EC2 customization

2015-02-25 Thread Akhil Das
You can easily add a function (say setup_pig) inside the function setup_cluster in this script https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L649 Thanks Best Regards On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak ssti...@live.com wrote: Hi, I was looking at the documentation

Re: Standalone spark

2015-02-25 Thread Sean Owen
Spark and Hadoop should be listed as 'provided' dependency in your Maven or SBT build. But that should make it available at compile time. On Wed, Feb 25, 2015 at 10:42 PM, boci boci.b...@gmail.com wrote: Hi, I have a little question. I want to develop a spark based application, but spark

upgrade to Spark 1.2.1

2015-02-25 Thread Pat Ferrel
Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1.1.0 I simply registered the class and its serializer with kryo like this:

Re: Standalone spark

2015-02-25 Thread boci
Thanks dude... I think I will pull up a docker container for integration test -- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Feb 26, 2015 at 12:22 AM, Sean Owen

Re: upgrade to Spark 1.2.1

2015-02-25 Thread Ted Yu
Could this be caused by Spark using shaded Guava jar ? Cheers On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote: Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With

RE: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-25 Thread Cheng, Hao
How many reducers you set for Hive? With small data set, Hive will run in local mode, which will set the reducer count always as 1. From: Kannan Rajah [mailto:kra...@maprtech.com] Sent: Thursday, February 26, 2015 3:02 AM To: Cheng Lian Cc: user@spark.apache.org Subject: Re: Spark-SQL 1.2.0 sort

Re: Standalone spark

2015-02-25 Thread Sean Owen
Yes, been on the books for a while ... https://issues.apache.org/jira/browse/SPARK-2356 That one just may always be a known 'gotcha' in Windows; it's kind of a Hadoop gotcha. I don't know that Spark 100% works on Windows and it isn't tested on Windows. On Wed, Feb 25, 2015 at 11:05 PM, boci

Considering Spark for large data elements

2015-02-25 Thread Rob Sargent
I have an application which might benefit from Sparks distribution/analysis, but I'm worried about the size and structure of my data set. I need to perform several thousand simulation on a rather large data set and I need access to all the generated simulations. The data element is largely

job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-25 Thread Darin McBeath
I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8 r3.8xlarge machines but limit the job to only 128 cores. I have also tried other things such as setting 4 workers per r3.8xlarge and 67gb each but this made no difference. The job frequently fails at the end in this step

Spark cluster set up on EC2 customization

2015-02-25 Thread Sameer Tilak
Hi, I was looking at the documentation for deploying Spark cluster on EC2. http://spark.apache.org/docs/latest/ec2-scripts.html We are using Pig to build the data pipeline and then use MLLib for analytics. I was wondering if someone has any experience to include additional tools/services

Re: throughput in the web console?

2015-02-25 Thread Tathagata Das
Yes. # tuples processed in a batch = sum of all the tuples received by all the receivers. In screen shot, there was a batch with 69.9K records, and there was a batch which took 1 s 473 ms. These two batches can be the same, can be different batches. TD On Wed, Feb 25, 2015 at 10:11 AM, Josh J

Scheduler hang?

2015-02-25 Thread Victor Tso-Guillen
I'm getting this really reliably on Spark 1.2.1. Basically I'm in local mode with parallelism at 8. I have 222 tasks and I never seem to get far past 40. Usually in the 20s to 30s it will just hang. The last logging is below, and a screenshot of the UI. 2015-02-25 20:39:55.779 GMT-0800 INFO

Fwd: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
Forwarding conversation below that didn't make it to the list. -- Forwarded message -- From: Jim Kleckner j...@cloudphysics.com Date: Wed, Feb 25, 2015 at 8:42 PM Subject: Re: Spark excludes fastutil dependencies we need To: Ted Yu yuzhih...@gmail.com Cc: Sean Owen

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread anamika gupta
I am now getting the following error. I cross-checked my types and corrected three of them i.e. r26--String, r27--Timestamp, r28--Timestamp. This error still persists. scala sc.textFile(/home/cdhuser/Desktop/Sdp_d.csv).map(_.split(,)).map { r = | val upto_time = sdf.parse(r(23).trim); |

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-25 Thread Tristan Blakers
I get the same exception simply by doing a large broadcast of about 6GB. Note that I’m broadcasting a small number (~3m) of fat objects. There’s plenty of free RAM. This and related kryo exceptions seem to crop-up whenever an object graph of more than a couple of GB gets passed around. at

Re: group by order by fails

2015-02-25 Thread Akhil Das
Which version of spark are you having? It seems there was a similar Jira https://issues.apache.org/jira/browse/SPARK-2474 Thanks Best Regards On Thu, Feb 26, 2015 at 12:03 PM, tridib tridib.sama...@live.com wrote: Hi, I need to find top 10 most selling samples. So query looks like: select

RE: group by order by fails

2015-02-25 Thread Tridib Samanta
Actually I just realized , I am using 1.2.0. Thanks Tridib Date: Thu, 26 Feb 2015 12:37:06 +0530 Subject: Re: group by order by fails From: ak...@sigmoidanalytics.com To: tridib.sama...@live.com CC: user@spark.apache.org Which version of spark are you having? It seems there was a similar Jira

Re: Fwd: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
I created an issue and pull request. Discussion can continue there: https://issues.apache.org/jira/browse/SPARK-6029 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-excludes-fastutil-dependencies-we-need-tp21812p21814.html Sent from the Apache

group by order by fails

2015-02-25 Thread tridib
Hi, I need to find top 10 most selling samples. So query looks like: select s.name, count(s.name) from sample s group by s.name order by count(s.name) This query fails with following error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [COUNT(name#0) ASC], true

Re: Scheduler hang?

2015-02-25 Thread Akhil Das
What operation are you trying to do and how big is the data that you are operating on? Here's a few things which you can try: - Repartition the RDD to a higher number than 222 - Specify the master as local[*] or local[10] - Use Kryo Serializer (.set(spark.serializer,

Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Jim Kleckner
Inline On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote: Interesting. Looking at SparkConf.scala : val configs = Seq( DeprecatedConfig(spark.files.userClassPathFirst, spark.executor.userClassPathFirst, 1.3),

Re: Executor lost with too many temp files

2015-02-25 Thread Raghavendra Pandey
Can you try increasing the ulimit -n on your machine. On Mon, Feb 23, 2015 at 10:55 PM, Marius Soutier mps@gmail.com wrote: Hi Sameer, I’m still using Spark 1.1.1, I think the default is hash shuffle. No external shuffle service. We are processing gzipped JSON files, the partitions are

Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Ted Yu
Maybe drop the exclusion for parquet-provided profile ? Cheers On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote: Inline On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote: Interesting. Looking at SparkConf.scala : val configs = Seq(

Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-25 Thread Tathagata Das
Spark Streaming has a new Kafka direct stream, to be release as experimental feature with 1.3. That uses a low level consumer. Not sure if it satisfies your purpose. If you want more control, its best to create your own Receiver with the low level Kafka API. TD On Tue, Feb 24, 2015 at 12:09 AM,

Re: Number of parallel tasks

2015-02-25 Thread Akhil Das
Did you try setting .set(spark.cores.max, 20) Thanks Best Regards On Wed, Feb 25, 2015 at 10:21 PM, Akshat Aranya aara...@gmail.com wrote: I have Spark running in standalone mode with 4 executors, and each executor with 5 cores each (spark.executor.cores=5). However, when I'm processing an

What is best way to run spark job in yarn-cluster mode from java program(servlet container) and NOT using spark-submit command.

2015-02-25 Thread kshekhram
Hello Spark experts I have tried reading spark documentation and searched many posts in this forum but I couldn't find satisfactory answer to my question. I have recently started using spark, so I may be missing something and that's why I am looking for your guidance here. I have a

Setting up Spark with YARN on EC2 cluster

2015-02-25 Thread Harika
Hi, I want to setup a Spark cluster with YARN dependency on Amazon EC2. I was reading this https://spark.apache.org/docs/1.2.0/running-on-yarn.html document and I understand that Hadoop has to be setup for running Spark with YARN. My questions - 1. Do we have to setup Hadoop cluster on EC2

Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it also increase job speed by avoiding network transfers? Assuming I’m doing joins or other shuffle operations. Thanks

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Sean Owen
If you mean, can both copies of the blocks be used for computations? yes they can. On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier mps@gmail.com wrote: Hi, just a quick question about calling persist with the _2 option. Is the 2x replication only useful for fault tolerance, or will it

Re: Effects of persist(XYZ_2)

2015-02-25 Thread Marius Soutier
Yes. Effectively, could it avoid network transfers? Or put differently, would an option like persist(MEMORY_ALL) improve job speed by caching an RDD on every worker? On 25.02.2015, at 11:42, Sean Owen so...@cloudera.com wrote: If you mean, can both copies of the blocks be used for

Re: No executors allocated on yarn with latest master branch

2015-02-25 Thread Anders Arpteg
We're using the capacity scheduler, to the best of my knowledge. Unsure if multi resource scheduling is used, but if you know of an easy way to figure that out, then let me know. Thanks, Anders On Sat, Feb 21, 2015 at 12:05 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Are you using the

Re: method newAPIHadoopFile

2015-02-25 Thread patcharee
This is the declaration of my custom inputformat public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat Best, Patcharee On 25. feb. 2015 10:15, patcharee wrote: Hi,

Re: method newAPIHadoopFile

2015-02-25 Thread Sean Owen
OK, from the declaration you sent me separately: public class NetCDFFileInputFormat extends ArrayBasedFileInputFormat public abstract class ArrayBasedFileInputFormat extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat It looks like you do not declare any generic types that

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread anamika gupta
The link has proved helpful. I have been able to load data, register it as a table and perform simple queries. Thanks Akhil !! Though, I still look forward to knowing where I was going wrong with my previous technique of extending the Product Interface to overcome case class's limit of 22 fields.

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Akhil Das
It says sdp_d not found, since it is a class you need to instantiate it once. like: sc.textFile(derby.log).map(_.split(,)).map( r = { val upto_time = sdf.parse(r(23).trim); calendar.setTime(upto_time); val r23 = new java.sql.Timestamp(upto_time.getTime);

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Harika Matha
Hi Yana, I tried running the program after setting the property spark.scheduler.mode to FAIR. But the result is same as previous. Are there any other properties that have to be set? On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: It's hard to tell. I have not

method newAPIHadoopFile

2015-02-25 Thread patcharee
Hi, I am new to spark and scala. I have a custom inputformat (used before with mapreduce) and I am trying to use it in spark. In java api (the syntax is correct): JavaPairRDDWRFIndex, WRFVariable pairVarOriRDD = sc.newAPIHadoopFile( path, NetCDFFileInputFormat.class,

Re: throughput in the web console?

2015-02-25 Thread Akhil Das
Did you have a look at https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener And for Streaming: https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Thanks Best Regards On Tue, Feb 24,

RE: used cores are less then total no. of core

2015-02-25 Thread Somnath Pandeya
Thanks Akhil , it was a simple fix which you told .. I missed it .. ☺ From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Wednesday, February 25, 2015 12:48 PM To: Somnath Pandeya Cc: user@spark.apache.org Subject: Re: used cores are less then total no. of core You can set the following in

Re: spark streaming: stderr does not roll

2015-02-25 Thread Sean Owen
These settings don't control what happens to stderr, right? stderr is up to the process that invoked the driver to control. You may wish to configure log4j to log to files instead. On Wed, Nov 12, 2014 at 8:15 PM, Nguyen, Duc duc.ngu...@pearson.com wrote: I've also tried setting the

Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Sean Owen
No, we should not add fastutil back. It's up to the app to bring dependencies it needs, and that's how I understand this issue. The question is really, how to get the classloader visibility right. It depends on where you need these classes. Have you looked into spark.files.userClassPathFirst and

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Petar Zecevic
I believe your class needs to be defined as a case class (as I answered on SO).. On 25.2.2015. 5:15, anamika gupta wrote: Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema?

How to efficiently control concurrent Spark jobs

2015-02-25 Thread Staffan
Hi, Is there a good way (recommended way) to control and run multiple Spark jobs within the same application? My application is like follows; 1) Run one Spark job on a 'ful' dataset, which then creates a few thousands of RDDs containing sub-datasets from the complete dataset. Each of the