Spark issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread shivamverma
Hi I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS node. I am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try

Duplicated UnusedStubClass in assembly

2015-07-13 Thread Luis Ángel Vicente Sánchez
I have just upgraded to spark 1.4.0 and it seems that spark-streaming-kafka has a dependency on org.spark-project.spark unused 1.0.0 but it also embeds that jar in its artifact, causing a problem while creating a fatjar. This is the error: [Step 1/1] (*:assembly) deduplicate: different file

[MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of time (16+ mints). It takes lot of time at this task; org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33) org.apache.spark.mllib.clustering.KMeansModel.computeCost(KMeansModel.scala:70) Can this be

Re: MovieALS Implicit Error

2015-07-13 Thread Benedict Liang
Hi Sean, Thank you for your quick response. By very little data, do you mean that the matrix is too sparse? Or are there too little data points? There are 3856988 ratings that are in my dataset currently. Regards, Benedict On Mon, Jul 13, 2015 at 7:07 PM, Sean Owen so...@cloudera.com wrote:

Re: Velox Model Server

2015-07-13 Thread Nick Pentreath
Honestly I don't believe this kind of functionality belongs within spark-jobserver. For serving of factor-type models, you are typically in the realm of recommendations or ad-serving scenarios - i.e. needing to score a user / context against many possible items and return a top-k list of those.

Spark Intro

2015-07-13 Thread vinod kumar
Hi Everyone, I am developing application which handles bulk of data around millions(This may vary as per user's requirement) records.As of now I am using MsSqlServer as back-end and it works fine but when I perform some operation on large data I am getting overflow exceptions.I heard about spark

Re: MovieALS Implicit Error

2015-07-13 Thread Sean Owen
Is the data set synthetic, or has very few items? or is indeed very sparse? those could be reasons. However usually this kind of thing happens with very small data sets. I could be wrong about what's going on, but it's a decent guess at the immediate cause given the error messages. On Mon, Jul

Share RDD from SparkR and another application

2015-07-13 Thread harirajaram
Hello, I would like to share RDD between an application and sparkR. I understand we have job-server and IBM kernel for sharing the context for different applications but not sure how we can use it with sparkR as it is some sort of front end (R shell) with spark. Any insights appreciated. Hari

Stopping StreamingContext before receiver has started

2015-07-13 Thread Juan Rodríguez Hortalá
Hi, I have noticed that when StreamingContext.stop is called when no receiver has started yet, then the context is not really stopped. Watching the logs it looks like a stop signal is sent to 0 receivers, because the receivers have not started yet, and then the receivers are started and the

Re: Data Processing speed SQL Vs SPARK

2015-07-13 Thread Sandeep Giri
Even for 2L records the MySQL will be better. Regards, Sandeep Giri, +1-253-397-1945 (US) +91-953-899-8962 (IN) www.KnowBigData.com. http://KnowBigData.com. [image: linkedin icon] https://linkedin.com/company/knowbigdata [image: other site icon] http://knowbigdata.com [image: facebook icon]

MovieALS Implicit Error

2015-07-13 Thread bliang
Hi,I am trying to run the MovieALS example with an implicit dataset and am receiving this error: Got 3856988 ratings from 144250 users on 378937 movies.Training: 3085522, test: 771466.15/07/13 10:43:07 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS15/07/13

Re: MovieALS Implicit Error

2015-07-13 Thread Sean Owen
I interpret this to mean that the input to the Cholesky decomposition wasn't positive definite. I think this can happen if the input matrix is singular or very near singular -- maybe, very little data? Ben that might at least address why this is happening; different input may work fine. Xiangrui

Re: Data Processing speed SQL Vs SPARK

2015-07-13 Thread Ashish Mukherjee
MySQL and PgSQL scale to millions. Spark or any distributed/clustered computing environment would be inefficient for the kind of data size you mention. That's because of coordination of processes, moving data around etc. On Mon, Jul 13, 2015 at 5:34 PM, Sandeep Giri sand...@knowbigdata.com wrote:

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Ewan Higgs
Konstatinos, Sure, if you have a resource leak then the collector can't free up memory and the process will use more memory. Time to break out the profiler and see where the memory is going. The usual suspects are handles to resources (open file streams, sockets, etc) kept in containers

RE: Including additional scala libraries in sparkR

2015-07-13 Thread Sun, Rui
Hi, Michal, SparkR comes with a JVM backend that supports Java object instantiation, calling Java instance and static methods from R side. As defined in https://github.com/apache/spark/blob/master/R/pkg/R/backend.R, newJObject() is to create an instance of a Java class; callJMethod() is to call

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-13 Thread Ashish Dutt
Hello Rui Sun, Thanks for your reply. On reading the file readme.md in the section Using SparkR from RStudio it mentions to set the .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) Please tell me how I can set this in Windows environment? What I mean is how to setup

Re: MovieALS Implicit Error

2015-07-13 Thread Benedict Liang
Hi Sean, This user dataset is organic. What do you think is a good ratings threshold then? I am only encountering this with the implicit type though. The explicit type works fine though (though it is not suitable for this dataset). Thank you, Benedict On Mon, Jul 13, 2015 at 7:15 PM, Sean Owen

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/Total InputOutputShuffle Read Shuffle Write 42 (kill) http://localhost:4040/stages/stage/kill/?id=42terminate=true repartition at UnsupervisedSparkModelBuilder.java:120

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Can it be the limited memory causing this slowness? On Tue, Jul 14, 2015 at 9:00 AM, Nirmal Fernando nir...@wso2.com wrote: Thanks Burak. Now it takes minutes to repartition; Active Stages (1) Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle Read Shuffle Write

RDD checkpoint

2015-07-13 Thread 牛兆捷
The checkpointed RDD computed twice, why not do the checkpoint for the RDD once it is computed? Is there any special reason for this? -- *Regards,* *Zhaojie*

Re: How to speed up Spark process

2015-07-13 Thread Aniruddh Sharma
Hi Deepak Not 100% sure , but please try increasing (--executor-cores ) to twice the number of your physical cores on your machine. Thanks and Regards Aniruddh On Tue, Jul 14, 2015 at 9:49 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Its been 30 minutes and still the partitioner has not

Research ideas using spark

2015-07-13 Thread Shashidhar Rao
Hi, I am doing my PHD thesis on large scale machine learning e.g Online learning, batch and mini batch learning. Could somebody help me with ideas especially in the context of Spark and to the above learning methods. Some ideas like improvement to existing algorithms, implementing new features

Re: How to speed up Spark process

2015-07-13 Thread ๏̯͡๏
I reduced the number of partitions to 1/4 to 76 in order to reduce the time to 1/4 (from 33 to 8) But the re-parition is still running beyond 15 mins. @Nirmal click on details, shows the code lines and does not show why it is slow. I know that repartition is slow and want to speed it up

Re: How to speed up Spark process

2015-07-13 Thread Nirmal Fernando
If you press on the +details you could see the code that takes time. Did you already check it? On Tue, Jul 14, 2015 at 9:56 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Job view. Others are fast, but the first one (repartition) is taking 95% of job run time. On Mon, Jul 13, 2015 at 9:23 PM,

Adaptive behavior of Spark at different network transfer rates?

2015-07-13 Thread Niklas Wilcke
Hello, I'm facing a strange behavior regarding a larger data processing pipeline consisting of multiple steps involving Spark core and GraphX. Increasing the network transfer rate in the 5 node cluster from 100 Mbit/s to 1 Gbit/s the runtime also increases from around 15 minutes to 19 Minutes.

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Ashish Dutt
Hello Arun, Thank you for the descriptive response. And thank you for providing the sample file too. It certainly is a great help. Sincerely, Ashish On Mon, Jul 13, 2015 at 10:30 PM, Arun Verma arun.verma...@gmail.com wrote: PFA sample file On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread ashishdutt
Many thanks for your response. Regards, Ashish -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-change-the-default-port-number-7077-for-spark-tp23774p23797.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-07-13 Thread ashishdutt
Hi, Try this Sys.setenv(SPARK_HOME=C:\\spark-1.4.0) # The path to your spark installation .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR, lib.loc=C:\\spark-1.4.0\\lib) # The path to the lib folder in the spark location library(SparkR)

Re: Spark Standalone Mode not working in a cluster

2015-07-13 Thread Eduardo
Akhil Das: Thanks for your reply. I am using exactly the same installation everywhere. Actually, the spark directory is shared among all nodes, including the place where I start pyspark. So, I believe this is not the problem. Regards, Eduardo On Mon, Jul 13, 2015 at 3:56 AM, Akhil Das

Do SparkSQL support subquery?

2015-07-13 Thread Louis Hust
Hi, all I am using spark 1.4, and find some sql is not support, especially the subquery, such as subquery in select items, in where clause, and in predicate conditions. So i want to know if spark support subquery or i am in the wrong way using spark sql? If not support subquery, is there a plan

Re: Data Processing speed SQL Vs SPARK

2015-07-13 Thread ayan guha
I would probably also look at what kind of analytical use case is to servefor example unification of Streaming, Batch and machine learning workloads can be easily achieved in Spark. This is one of the USP of Spark. But if SQL is the only use case, and data volume 1 milion or 100 GB, I think

[SPARK-SQL] Window Functions optimization

2015-07-13 Thread Hao Ren
Hi, I would like to know: Is there any optimization has been done for window functions in Spark SQL? For example. select key, max(value1) over(partition by key) as m1, max(value2) over(partition by key) as m2, max(value3) over(partition by key) as m3 from table The query above creates 3

Re: spark streaming doubt

2015-07-13 Thread Shushant Arora
For second question I am comparing 2 situtations of processing kafkaRDD. case I - When I used foreachPartition to process kafka stream I am not able to see any stream job timing interval like Time: 142905487 ms . displayed on driver console at start of each stream batch. But it processed

Re: Do SparkSQL support subquery?

2015-07-13 Thread ayan guha
In Jira, it says in progress https://issues.apache.org/jira/browse/SPARK-4226 On Mon, Jul 13, 2015 at 11:10 PM, Louis Hust louis.h...@gmail.com wrote: Hi, all I am using spark 1.4, and find some sql is not support, especially the subquery, such as subquery in select items, in where clause,

Re: createDirectStream and Stats

2015-07-13 Thread Cody Koeninger
Reading from kafka is always going to be bounded by the number of kafka partitions you have, regardless of what you're using to read it. If most of your time is coming from calculation, not reading, then yes a spark repartition will help. If most of your time is coming just from reading, you

Re: Duplicated UnusedStubClass in assembly

2015-07-13 Thread Cody Koeninger
Yeah, I had brought that up a while back, but didn't get agreement on removing the stub. Seems to be an intermittent problem. You can just add an exclude: mergeStrategy in assembly := { case PathList(org, apache, spark, unused, UnusedStubClass.class) = MergeStrategy.first case x =

Re: spark streaming doubt

2015-07-13 Thread Cody Koeninger
Regarding your first question, having more partitions than you do executors usually means you'll have better utilization, because the workload will be distributed more evenly. There's some degree of per-task overhead, but as long as you don't have a huge imbalance between number of tasks and

Re: sparkR

2015-07-13 Thread ashishdutt
Please can you explain how did you set this second step in windows environment? .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) I mean to ask where do I type this command, at R prompt or in command prompt? Thanks for your time. Regards, Ashish -- View this message in

Re: sparkR

2015-07-13 Thread ashishdutt
I had been facing this problem for a long time and this practically forced me to move to pyspark. This is what I tried after reading the posts here Sys.setenv(SPARK_HOME=C:\\spark-1.4.0) .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths())) library(SparkR,

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
PFA sample file On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma arun.verma...@gmail.com wrote: Hi, Yes it is. To do it follow these steps; 1. cd spark/intallation/path/.../conf 2. cp spark-env.sh.template spark-env.sh 3. vi spark-env.sh 4. SPARK_MASTER_PORT=9000(or any other available port)

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Peter Rudenko
Hi Andrew, here's what i found. Maybe would be relevant for people with the same issue: 1) There's 3 types of local resources in YARN (public, private, application). More about it here: http://hortonworks.com/blog/management-of-application-dependencies-in-yarn/ 2) Spark cache is of

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
Hi, Yes it is. To do it follow these steps; 1. cd spark/intallation/path/.../conf 2. cp spark-env.sh.template spark-env.sh 3. vi spark-env.sh 4. SPARK_MASTER_PORT=9000(or any other available port) PFA sample file. I hope this will help. On Mon, Jul 13, 2015 at 7:24 PM, ashishdutt

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska
Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened https://issues.apache.org/jira/browse/SPARK-6984 which I think is related to this as well. There are a bunch of issues attached to it but basically yes, Spark interactions with a large metastore are bad...very bad if your

Re: How to speed up Spark process

2015-07-13 Thread ๏̯͡๏
Its been 30 minutes and still the partitioner has not completed yet, its ever. Without repartition, i see this error https://issues.apache.org/jira/browse/SPARK-5928 FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=

Re: fileStream with old files

2015-07-13 Thread Terry Hole
A new configuration named *spark.streaming.minRememberDuration* was added since 1.2.1 to control the file stream input, the default value is *60 seconds*, you can change this value to a large value to include older files (older than 1 minute) You can get the detail from this jira:

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for the lead. I was able to modify the code using clues from the RegressionMetrics example. Here is what I got now. val deviceAggregateLogs = sc.textFile(logFile).map(DailyDeviceAggregates.parseLogLine).cache() // Calculate statistics based on bytes-transferred val

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
Dimensions mismatch when adding new sample. Expecting 8 but got 14. Make sure all the vectors you are summarizing over have the same dimension. Why would you want to write a MultivariateOnlineSummary object (which can be represented with a couple Double's) into a distributed filesystem like

Upgrade Spark-1.3.0 to Spark-1.4.0 in CDH5.4

2015-07-13 Thread ashishdutt
Hello all, The configuration of my cluster is as follows; # 4 noded cluster running on Centos OS 6.4 # spark-1.3.0 installed on all I would like to use SparkR shipped with spark-1.4.0. I checked Cloudera and find that the latest release CDH5.4 still does not have the spark-1.4.0. Forums like

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Hello Feynman, Actually in my case, the vectors I am summarizing over will not have the same dimension since many devices will be inactive on some days. This is at best a sparse matrix where we take only the active days and attempt to fit a moving average over it. The reason I would like to

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
It's a bit hard to tell from the snippets of code but it's likely related to the fact that when you serialize instances the enclosing class, if any, also gets serialized, as well as any other place where fields used in the closure come from...e.g.check this discussion:

Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: - Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626

Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on nodes wh

2015-07-13 Thread Elkhan Dadashov
Hi folks, I have a question regarding scheduling of Spark job on Yarn cluster. Let's say there are 5 nodes on Yarn cluster: A,B,C, D, E In Spark job I'll be reading some huge text file (sc.textFile(fileName)) from HDFS and create an RDD. Assume that only nodes A, E contain the blocks of that

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Sandy Ryza
To clear one thing up: the space taken up by data that Spark caches on disk is not related to YARN's local resource / application cache concept. The latter is a way that YARN provides for distributing bits to worker nodes. The former is just usage of disk by Spark, which happens to be in a local

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
What are the other parameters? Are you just setting k=3? What about # of runs? How many partitions do you have? How many cores does your machine have? Thanks, Burak On Mon, Jul 13, 2015 at 10:57 AM, Nirmal Fernando nir...@wso2.com wrote: Hi Burak, k = 3 dimension = 785 features Spark 1.4

Language support for Spark libraries

2015-07-13 Thread Lincoln Atkinson
I'm still getting acquainted with the Spark ecosystem, and wanted to make sure my understanding of the different API layers is correct. Is this an accurate picture of the major API layers, and their associated client support? Thanks, -Lincoln Spark Core: - Scala - Java -

Re: Spark off heap memory leak on Yarn with Kafka direct stream

2015-07-13 Thread Cody Koeninger
Does the issue only happen when you have no traffic on the topic? Have you profiled to see what's using heap space? On Mon, Jul 13, 2015 at 1:05 PM, Apoorva Sareen apoorva.sar...@gmail.com wrote: Hi, I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45

RE: Spark performance

2015-07-13 Thread Mohammed Guller
Good points, Michael. The underlying assumption in my statement is that cost is an issue. If cost is not an issue and the only requirement is to query structured data, then there are several databases such as Teradata, Exadata, and Vertica that can handle 4-6 TB of data and outperform Spark.

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
Hi Oded, I'm not sure I completely understand your question, but it sounds like you could have the READER receiver produce a DStream which is windowed/processed in Spark Streaming and forEachRDD to do the OUTPUT. However, streaming in SparkR is not currently supported (SPARK-6803

Re: Spark issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Feynman Liang
Can you send the error messages again? I'm not seeing them. On Mon, Jul 13, 2015 at 2:45 AM, shivamverma shivam13ve...@gmail.com wrote: Hi I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS node. I am trying to run grid search on an RF classifier to classify a small

Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Harish Butani
Just once. You can see this by printing the optimized logical plan. You will see just one repartition operation. So do: val df = sql(your sql...) println(df.queryExecution.analyzed) On Mon, Jul 13, 2015 at 6:37 AM, Hao Ren inv...@gmail.com wrote: Hi, I would like to know: Is there any

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
The call to Sorting.quicksort is not working. Perhaps I am calling it the wrong way. allaggregates.toArray allocates and creates a new array separate from allaggregates which is sorted by Sorting.quickSort; allaggregates. Try: val sortedAggregates = allaggregates.toArray

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
I'm using; org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); Cpu cores: 8 (using default Spark conf thought) On partitions, I'm not sure how to find that. On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz brk...@gmail.com wrote: What are the other parameters? Are you just

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Can you call repartition(8) or 16 on data.rdd(), before KMeans, and also, .cache()? something like, (I'm assuming you are using Java): ``` JavaRDDVector input = data.repartition(8).cache(); org.apache.spark.mllib.clustering.KMeans.train(input.rdd(), 3, 20); ``` On Mon, Jul 13, 2015 at 11:10 AM,

RE: java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Thank you, extending Serializable solved the issue. I am left with more questions than answers though :-). Regards, Saif From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Monday, July 13, 2015 2:49 PM To: Ellafi, Saif A. Cc: user@spark.apache.org; Liu, Weicheng Subject: Re:

Re: spark streaming doubt

2015-07-13 Thread Aniruddh Sharma
Hi Sushant/Cody, For question 1 , following is my understanding ( I am not 100% sure and this is only my understanding, I have asked this question in another words to TD for confirmation which is not confirmed as of now). Following is my understanding. In accordance with tasks created in

Re: Unit tests of spark application

2015-07-13 Thread Naveen Madhire
Thanks. Spark-testing-base works pretty well. On Fri, Jul 10, 2015 at 3:23 PM, Burak Yavuz brk...@gmail.com wrote: I can +1 Holden's spark-testing-base package. Burak On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau hol...@pigscanfly.ca wrote: Somewhat biased of course, but you can also

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Nirmal Fernando
Hi Burak, k = 3 dimension = 785 features Spark 1.4 On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz brk...@gmail.com wrote: Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13,

Spark off heap memory leak on Yarn with Kafka direct stream

2015-07-13 Thread Apoorva Sareen
Hi, I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually increasing the physical memory usage till

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando nir...@wso2.com wrote: Hi, For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot

RE: java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Thank you very much for your time, here is how I designed the case classes, as far as I know they apply properly. Ps: By the way, what do you mean by “The programming guide?” abstract class Validator { // positions to access with Row.getInt(x) val shortsale_in_pos = 10 val

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
I would certainly try to mark the Validator class as Serializable...If that doesn't do it you can also try and see if this flag sheds more light: -Dsun.io.serialization.extendedDebugInfo=true By programming guide I mean this: https://spark.apache.org/docs/latest/programming-guide.html I could

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Luis Ángel Vicente Sánchez
' to '/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76-8103-3a85ce752b58/spark-1.4.0-bin-hadoop2.4.tgz' I0713 15:59:50.700959 1327 fetcher.cpp:78] Extracted resource '/var/log/mcsvc

Re: Duplicated UnusedStubClass in assembly

2015-07-13 Thread Luis Ángel Vicente Sánchez
Hi! I was just raising this issue, I already solved it by excluding that transitive dependency. Thanks for your help anyway :) 2015-07-13 14:43 GMT+01:00 Cody Koeninger c...@koeninger.org: Yeah, I had brought that up a while back, but didn't get agreement on removing the stub. Seems to be an

Problems after upgrading to spark 1.4.0

2015-07-13 Thread Luis Ángel Vicente Sánchez
-1.4.0-bin-hadoop2.4.tgz' to '/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76-8103-3a85ce752b58/spark-1.4.0-bin-hadoop2.4.tgz' I0713 15:59:50.700959 1327 fetcher.cpp:78] Extracted

Does Spark Streaming support streaming from a database table?

2015-07-13 Thread unk1102
Hi I did Kafka streaming through Spark streaming I have a use case where I would like to stream data from a database table. I see JDBCRDD is there but that is not what I am looking for I need continuous streaming like JavaSparkStreaming which continuously runs and listens to changes in a database

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-13 Thread Yana Kadiyska
Oh, this is very interesting -- can you explain about your dependencies -- I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and removing the javax/servlet package out of it...but it's a pain in the neck. If I'm reading your first message correctly you use hadoop common and

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Tathagata Das
fetcher.cpp:135] Downloading 'http://s3-eu-west-1.amazonaws.com/int-mesos-data/frameworks/spark/spark-1.4.0-bin-hadoop2.4.tgz' to '/var/log/mcsvc/mesostmpdir/slaves/20150713-133618-421011372-5050-8867-S5/frameworks/20150713-152326-421011372-5050-12921-0002/executors/9/runs/9e44b2ea-c738-4e76

Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
If you want to exploit properly the 8 nodes of your cluster, you should use ~ 2 times that number for partitioning. You can specify the number of partitions when calling parallelize, as following: JavaRDDPoint pnts = sc.parallelize(points, 16); -- View this message in context:

HDFS performances + unexpected death of executors.

2015-07-13 Thread maxdml
Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: - HDFS classes not found - Connections with some datanode seems to be slow/

Re: How to make my spark implementation parallel?

2015-07-13 Thread maxdml
can you please share your application code? I suspect that you're not making a good use of the cluster by configuring a wrong number of partitions in your RDDs. -- View this message in context:

cache() VS cacheTable()

2015-07-13 Thread Srikanth
Hello, I was reading learning spark book and saw a tip in chapter 9 that read In Spark 1.2, the regular cache() method on RDDs also results in a cacheTable() Is that true? When I cache a RDD and cache same data as a dataframe I see that memory usage for dataframe cache is way less than RDD

java.io.InvalidClassException

2015-07-13 Thread Saif.A.Ellafi
Hi, For some experiment I am doing, I am trying to do the following. 1.Created an abstract class Validator. Created case objects from Validator with validate(row: Row): Boolean method. 2. Adding in a list all case objects 3. Each validate takes a Row into account, returns itself if validate

Re: Does Spark driver talk to NameNode directly or Yarn Resource Manager talks to NameNode to know the nodes which has required input blocks and informs Spark Driver ? (for launching Executors on node

2015-07-13 Thread Elkhan Dadashov
Thanks Michael for your answer. But Yarn of today does not manage HDFS. How does Yarn RM get to know HDFS blocks in each data node ? Do you mean it is Yarn RM contacts NameNode for HDFS block data in each node, and then decided to launch executor on the nodes which has required input data blocks

How to set the heap size on consumers?

2015-07-13 Thread dgoldenberg
Hi, I'm seeing quite a bit of information on Spark memory management. I'm just trying to set the heap size, e.g. Xms as 512m and Xmx as 1g or some such. Per http://apache-spark-user-list.1001560.n3.nabble.com/Use-of-SPARK-DAEMON-JAVA-OPTS-tt10479.html#a10529: SPARK_DAEMON_JAVA_OPTS is not

spark task hangs at BinaryClassificationMetrics (InetAddress related)

2015-07-13 Thread Asher Krim
Hey everyone, We are running into an issue where spark jobs will sometimes hang indefinitely. We are on Spark 1.3.1 (working on upgrading soon), Java 8, and using mesos with spark.mesos.coarse=false. I'm fairly certain that the issue comes up when we do shuffle operations. My pipeline reads data

Re: [SPARK-SQL] Window Functions optimization

2015-07-13 Thread Yin Huai
Your query will be partitioned once. Then, a single Window operator will evaluate these three functions. As mentioned by Harish, you can take a look at the plan (sql(your sql...).explain()). On Mon, Jul 13, 2015 at 12:26 PM, Harish Butani rhbutani.sp...@gmail.com wrote: Just once. You can see

MLLIB RDD segmentation for logistic regression

2015-07-13 Thread Saif.A.Ellafi
Hello all, I have one big RDD, in which there is a column of groups A1, A2, B1, B2, B3, C1, D1, ..., XY. Out of it, I am using map() to transform into RDD[LabeledPoint] with dense vectors for later use into Logistic Regression, which takes RDD[LabeledPoint] I would like to run a logistic

Re: Few basic spark questions

2015-07-13 Thread Feynman Liang
Sorry; I think I may have used poor wording. SparkR will let you use R to analyze the data, but it has to be loaded into memory using SparkR (see SparkR DataSources http://people.apache.org/~pwendell/spark-releases/latest/sparkr.html). You will still have to write a Java receiver to store the data

Basic Spark SQL question

2015-07-13 Thread Ron Gonzalez
Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Feynman Liang
A good example is RegressionMetrics https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala#L48's use of of OnlineMultivariateSummarizer to aggregate statistics across labels and residuals; take a look at how aggregateByKey is used

Re: Language support for Spark libraries

2015-07-13 Thread Davies Liu
On Mon, Jul 13, 2015 at 11:06 AM, Lincoln Atkinson lat...@microsoft.com wrote: I’m still getting acquainted with the Spark ecosystem, and wanted to make sure my understanding of the different API layers is correct. Is this an accurate picture of the major API layers, and their associated

Re: Spark off heap memory leak on Yarn with Kafka direct stream

2015-07-13 Thread Apoorva Sareen
It happens irrespective of whether there is traffic or no traffic on the kafka topic. Also, there is no clue i could see in the heap space. The heap looks healthy and stable. Its something off heap which is constantly growing. I also checked the JNI reference count from the dumps which appear

HiveThriftServer2.startWithContext error with registerTempTable

2015-07-13 Thread Srikanth
Hello, I want to expose result of Spark computation to external tools. I plan to do this with Thrift server JDBC interface by registering result Dataframe as temp table. I wrote a sample program in spark-shell to test this. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import

Re: Basic Spark SQL question

2015-07-13 Thread Jerrick Hoang
Well for adhoc queries you can use the CLI On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to

hive-site.xml spark1.3

2015-07-13 Thread Jerrick Hoang
Hi all, I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql CLI doesn't pick it up. (copying the same conf/ files to spark1.4 and 1.2 works fine). Just wondering if someone has seen this before, Thanks

Re: Finding moving average using Spark and Scala

2015-07-13 Thread Anupam Bagchi
Thank you Feynman for your response. Since I am very new to Scala I may need a bit more hand-holding at this stage. I have been able to incorporate your suggestion about sorting - and it now works perfectly. Thanks again for that. I tried to use your suggestion of using

Re: Basic Spark SQL question

2015-07-13 Thread Michael Armbrust
I'd look at the JDBC server (a long running yarn job you can submit queries too) https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Well for adhoc queries you can use

Re: Caching in spark

2015-07-13 Thread Akhil Das
There was a discussion happened on that earlier, let me re-post it for you. For the following code: val *df* = sqlContext.parquetFile(path) *df* remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val *cdf* = df.cache() *cdf* is

Re: Few basic spark questions

2015-07-13 Thread Oded Maimon
any help / idea will be appreciated :) thanks Regards, Oded Maimon Scene53. On Sun, Jul 12, 2015 at 4:49 PM, Oded Maimon o...@scene53.com wrote: Hi All, we are evaluating spark for real-time analytic. what we are trying to do is the following: - READER APP- use custom receiver to get

Re: javaRDD.saveasTextfile saves each line enclosed by square brackets

2015-07-13 Thread dineh210
Hi , Please can anyone help me on this post, it seems to be a show stopper for our current project Thanks inAdvance Regards Dinesh -- View this message in context:

  1   2   >