Re: high GC in the Kmeans algorithm

2015-02-17 Thread lihu
Thanks for your answer. Yes, I cached the data, I can observed from the WebUI that all the data is cached in the memory. What I worry is that the dimension, not the total size. Sean Owen ever answered me that the Broadcast support the maximum array size is 2GB, so 10^7 is a little huge? On

Re: Lost task - connection closed

2015-02-17 Thread Tianshuo Deng
Hi, Thanks for the reponse. I discovered my problem was that some of the executors got OOM, tracing down the logs of executors helps discovering the problem. Usually the log from the driver do not reflect the OOM error and therefore causes confusions among users. This is just the discoveries on

Re: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
Thanks Imran. That's exactly what I needed to know. Darin. From: Imran Rashid iras...@cloudera.com To: Darin McBeath ddmcbe...@yahoo.com Cc: User user@spark.apache.org Sent: Tuesday, February 17, 2015 8:35 PM Subject: Re: How do you get the partitioner for

RangePartitioner in Spark 1.2.1

2015-02-17 Thread java8964
Hi, Sparkers: I just happened to search in google for something related to the RangePartitioner of spark, and found an old thread in this email list as here: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html I followed the code example mentioned in that email thread

JsonRDD to parquet -- data loss

2015-02-17 Thread Vasu C
Hi, I am running spark batch processing job using spark-submit command. And below is my code snippet. Basically converting JsonRDD to parquet and storing it in HDFS location. The problem I am facing is if multiple jobs are are triggered parallely, even though job executes properly (as i can see

Berlin Apache Spark Meetup

2015-02-17 Thread Ralph Bergmann | the4thFloor.eu
Hi, there is a small Spark Meetup group in Berlin, Germany :-) http://www.meetup.com/Berlin-Apache-Spark-Meetup/ Plaes add this group to the Meetups list at https://spark.apache.org/community.html Ralph - To unsubscribe,

Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you. Matei On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, there is a small Spark Meetup group in Berlin, Germany :-) http://www.meetup.com/Berlin-Apache-Spark-Meetup/ Plaes add this group to the Meetups list at

Re: MapValues and Shuffle Reads

2015-02-17 Thread Imran Rashid
Hi Darrin, You are asking for something near dear to me: https://issues.apache.org/jira/browse/SPARK-1061 There is a PR attached there as well. Note that you could do everything in that PR in your own user code, you don't need to wait for it to get merged, *except* for the change to HadoopRDD

Re: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Imran Rashid
a JavaRDD is just a wrapper around a normal RDD defined in scala, which is stored in the rdd field. You can access everything that way. The JavaRDD wrappers just provide some interfaces that are a bit easier to work with in Java. If this is at all convincing, here's me demonstrating it inside

Spark Streaming output cannot be used as input?

2015-02-17 Thread Jose Fernandez
Hello folks, Our intended use case is: - Spark Streaming app #1 reads from RabbitMQ and output to HDFS - Spark Streaming app #2 reads #1's output and stores the data into Elasticsearch The idea behind this architecture is that if Elasticsearch is down due to an upgrade or

spark-core in a servlet

2015-02-17 Thread Ralph Bergmann | the4thFloor.eu
Hi, I want to use spark-core inside of a HttpServlet. I use Maven for the build task but I have a dependency problem :-( I get this error message: ClassCastException: com.sun.jersey.server.impl.container.servlet.JerseyServletContainerInitializer cannot be cast to

Re: Lost task - connection closed

2015-02-17 Thread Tianshuo Deng
Hi, Thanks for the reponse. I discovered my problem was that some of the executors got OOM, tracing down the logs of executors helps discovering the problem. Usually the log from the driver do not reflect the OOM error and therefore causes confusions among users. This is just the discoveries on

Re: JsonRDD to parquet -- data loss

2015-02-17 Thread Arush Kharbanda
I am not sure, if this the easiest way to solve your problem. But you can connect to the HIVE metastore(through derby) and find the HDFS path from there. On Wed, Feb 18, 2015 at 9:31 AM, Vasu C vasuc.bigd...@gmail.com wrote: Hi, I am running spark batch processing job using spark-submit

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn
Thanks for the reply, I'll try your suggestions. Apologies, in my previous post I was mistaken. rdd is actually an PairRDD of (Int, Int). I'm doing the self-join so I can count two things. First, I can count the number of times a value appears in the data set. Second I can count number of times

Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread dgoldenberg
I'm getting the below error when running spark-submit on my class. This class has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ 4.10.3 from within the class. This is in conflict with the older version, HttpClient 3.1 that's a dependency of Hadoop 2.4 (I'm running Spark

OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Tom Walwyn
Hi All, I'm a new Spark (and Hadoop) user and I want to find out if the cluster resources I am using are feasible for my use-case. The following is a snippet of code that is causing a OOM exception in the executor after about 125/1000 tasks during the map stage. val rdd2 = rdd.join(rdd,

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Imran for very detailed explanations and options. I think for now T-Digest is what I want. From: iras...@cloudera.com Date: Tue, 17 Feb 2015 08:39:48 -0600 Subject: Re: Percentile example To: myl...@hotmail.com CC: user@spark.apache.org (trying to repost to the list w/out URLs --

Re: OutOfMemory and GC limits (TODO) Error in map after self-join

2015-02-17 Thread Akhil Das
Why are you joining the rdd with itself? You can try these things: - Change the StorageLevel of both rdds to MEMORY_AND_DISK_2 or MEMORY_AND_DISK_SER, so that it doesnt need to keep everything up in memory. - Set your default Serializer to Kryo (.set(spark.serializer,

Re: spark-core in a servlet

2015-02-17 Thread Arush Kharbanda
I am not sure if this could be causing the issue but spark is compatible with scala 2.10. Instead of spark-core_2.11 you might want to try spark-core_2.10 On Wed, Feb 18, 2015 at 5:44 AM, Ralph Bergmann | the4thFloor.eu ra...@the4thfloor.eu wrote: Hi, I want to use spark-core inside of a

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not sure whether my cluster was configured well for it. But will definitely have a try.  From: c.koh...@elsevier.com To: myl...@hotmail.com; user@spark.apache.org Subject: Re: Percentile example Date: Tue, 17 Feb 2015

Re: How to pass parameters to a spark-jobserver Scala class?

2015-02-17 Thread Vasu C
Hi Sasi, To pass parameters to spark-jobserver usecurl -d input.string = a b c a b see and in Job server class use config.getString(input.string). You can pass multiple parameters like starttime,endtime etc and use config.getString() to get the values. The examples are shown here

Re: RangePartitioner in Spark 1.2.1

2015-02-17 Thread Aaron Davidson
RangePartitioner does not actually provide a guarantee that all partitions will be equal sized (that is hard), and instead uses sampling to approximate equal buckets. Thus, it is possible that a bucket is left empty. If you want the specified behavior, you should define your own partitioner. It

Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread Arush Kharbanda
Hi Did you try to make maven pick the latest version http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management That way solrj won't cause any issue, you can try this and check if the part of your code where you access HDFS works fine? On Wed,

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
It would be good if you can share the piece of code that you are using, so people can suggest you how to optimize it further and stuffs like that. Also, since you are having 20Gb of memory and ~30Gb of data, you can try doing a rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) or

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Charles Feduke
Emre, As you are keeping the properties file external to the JAR you need to make sure to submit the properties file as an additional --files (or whatever the necessary CLI switch is) so all the executors get a copy of the file along with the JAR. If you know you are going to just put the

Re: Percentile example

2015-02-17 Thread Imran Rashid
(trying to repost to the list w/out URLs -- rejected as spam earlier) Hi, Using take() is not a good idea, as you have noted it will pull a lot of data down to the driver so its not scalable. Here are some more scalable alternatives: 1. Approximate solutions 1a. Sample the data. Just sample

Processing graphs

2015-02-17 Thread Vijayasarathy Kannan
Hi, I am working on a Spark application that processes graphs and I am trying to do the following. - group the vertices (key - vertex, value - set of its outgoing edges) - distribute each key to separate processes and process them (like mapper) - reduce the results back at the main process Does

Re: Tuning number of partitions per CPU

2015-02-17 Thread Sean Owen
More tasks means a little *more* total CPU time is required, not less, because of the overhead of handling tasks. However, more tasks can actually mean less wall-clock time. This is because tasks vary in how long they take. If you have 1 task per core, the job takes as long as the slowest task

RE: SparkSQL + Tableau Connector

2015-02-17 Thread Andrew Lee
Hi Todd, When I see /data/json appears in your log, I have a feeling that that is the default hive.metastore.warehouse.dir from hive-site.xml where the value is /data/. Could you check that property and see if you can point that to the correct Hive table HDFS directory? Another thing to look

RE: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-17 Thread Andrew Lee
HI All, Just want to give everyone an update of what worked for me. Thanks for Cheng's comment and other ppl's help. So what I misunderstood was the --driver-class-path and how that was related to --files. I put both /etc/hive/hive-site.xml in both --files and --driver-class-path when I

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Gerard Maas
+1 for TypeSafe config Our practice is to include all spark properties under a 'spark' entry in the config file alongside job-specific configuration: A config file would look like: spark { master = cleaner.ttl = 123456 ... } job { context { src = foo action =

Re: Implementing FIRST_VALUE, LEAD, LAG in Spark

2015-02-17 Thread Dmitry Tolpeko
I ended up with the following: def firstValue(items: Iterable[String]) = for { i - items } yield (i, items.head) data.groupByKey().map{case(a, b)=(a, firstValue(b))}.collect More details: http://dmtolpeko.com/2015/02/17/first_value-last_value-lead-and-lag-in-spark/ I would appreciate any

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ?

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-17 Thread Imran Rashid
Hi Emre, there shouldn't be any difference in which files get processed w/ print() vs. foreachRDD(). In fact, if you look at the definition of print(), it is just calling foreachRDD() underneath. So there is something else going on here. We need a little more information to figure out exactly

MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
In the following code, I read in a large sequence file from S3 (1TB) spread across 1024 partitions. When I look at the job/stage summary, I see about 400GB of shuffle writes which seems to make sense as I'm doing a hash partition on this file. // Get the baseline input file

Re: OOM error

2015-02-17 Thread Harshvardhan Chauhan
Thanks for the pointer it led me to http://spark.apache.org/docs/1.2.0/tuning.html increasing parallelism resolved the issue. On Mon, Feb 16, 2015 at 11:57 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Increase your executor memory, Also you can play around with increasing the number of

Re: MapValues and Shuffle Reads

2015-02-17 Thread Imran Rashid
Hi Darin, When you say you see 400GB of shuffle writes from the first code snippet, what do you mean? There is no action in that first set, so it won't do anything. By itself, it won't do any shuffle writing, or anything else for that matter. Most likely, the .count() on your second code

Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)
The best approach I've found to calculate Percentiles in Spark is to leverage SparkSQL. If you use the Hive Query Language support, you can use the UDAFs for percentiles (as of Spark 1.2) Something like this (Note: syntax not guaranteed to run but should give you the gist of what you need

ERROR actor.OneForOneStrategy: org.apache.hadoop.conf.Configuration

2015-02-17 Thread Jadhav Shweta
Hi, I am running streaning word count program in Spark Standalone mode cluster, having four machines in cluster. public final class JavaKafkaStreamingWordCount { private static final Pattern SPACE = Pattern.compile( ); static transient Configuration conf; private

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
The raw data is ~30 GB. It consists of 250 millions sentences. The total length of the documents (i.e. the sum of the length of all sentences) is 11 billions. I also ran a simple algorithm to roughly count the maximum number of word pairs by summing up d * (d - 1) over all sentences, where d is

Hive, Spark, Cassandra, Tableau, BI, etc.

2015-02-17 Thread Ashic Mahtab
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables referencing Cassandra data using the thriftserver. Is there a secret to getting this to work? I've basically got Spark built with Hive, and a Cassandra cluster. Is there a way to get the hive server to talk to

Re: MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
Thanks Imran. I think you are probably correct. I was a bit surprised that there was no shuffle read in the initial hash partition step. I will adjust the code as you suggest to prove that is the case. I have a slightly different question. If I save an RDD to S3 (or some equivalent) and this

Re: Stepsize with Linear Regression

2015-02-17 Thread Xiangrui Meng
The best step size depends on the condition number of the problem. You can try some conditioning heuristics first, e.g., normalizing the columns, and then try a common step size like 0.01. We should implement line search for linear regression in the future, as in LogisticRegressionWithLBFGS. Line

Re: high GC in the Kmeans algorithm

2015-02-17 Thread Xiangrui Meng
Did you cache the data? Was it fully cached? The k-means implementation doesn't create many temporary objects. I guess you need more RAM to avoid GC triggered frequently. Please monitor the memory usage using YourKit or VisualVM. -Xiangrui On Wed, Feb 11, 2015 at 1:35 AM, lihu lihu...@gmail.com

Re: feeding DataFrames into predictive algorithms

2015-02-17 Thread Xiangrui Meng
Hey Sandy, The work should be done by a VectorAssembler, which combines multiple columns (double/int/vector) into a vector column, which becomes the features column for regression. We can going to create JIRAs for each of these standard feature transformers. It would be great if you can help

Re: MLib usage on Spark Streaming

2015-02-17 Thread Xiangrui Meng
JavaDStream.foreachRDD (https://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#foreachRDD(org.apache.spark.api.java.function.Function)) and Statistics.corr

Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng
The complexity of DIMSUM is independent of the number of rows but still have quadratic dependency on the number of columns. 1.5M columns may be too large to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das

Re: [POWERED BY] Radius Intelligence

2015-02-17 Thread Xiangrui Meng
Thanks! I added Radius to https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. -Xiangrui On Tue, Feb 10, 2015 at 12:02 AM, Alexis Roos alexis.r...@gmail.com wrote: Also long due given our usage of Spark .. Radius Intelligence: URL: radius.com Description: Spark, MLLib Using

Re: Unknown sample in Naive Baye's

2015-02-17 Thread Xiangrui Meng
If there exists a sample that doesn't not belong to A/B/C, it means that there exists another class D or Unknown besides A/B/C. You should have some of these samples in the training set in order to let naive Bayes learn the priors. -Xiangrui On Tue, Feb 10, 2015 at 10:44 PM, jatinpreet

Re: Naive Bayes model fails after a few predictions

2015-02-17 Thread Xiangrui Meng
Could you share the error log? What do you mean by 500 instead of 200? If this is the number of files, try to use `repartition` before calling naive Bayes, which works the best when the number of partitions matches the number of cores, or even less. -Xiangrui On Tue, Feb 10, 2015 at 10:34 PM,

How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
In an 'early release' of the Learning Spark book, there is the following reference: In Scala and Java, you can determine how an RDD is partitioned using its partitioner property (or partitioner() method in Java) However, I don't see the mentioned 'partitioner()' method in Spark 1.2 or a way

Re: WARN from Similarity Calculation

2015-02-17 Thread Xiangrui Meng
It may be caused by GC pause. Did you check the GC time in the Spark UI? -Xiangrui On Sun, Feb 15, 2015 at 8:10 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am sometimes getting WARN from running Similarity calculation: 15/02/15 23:07:55 WARN BlockManagerMasterActor: Removing

Re: Processing graphs

2015-02-17 Thread Yifan LI
Hi Kannan, I am not sure I have understood what your question is exactly, but maybe the reduceByKey or reduceByKeyLocally functionality is better to your need. Best, Yifan LI On 17 Feb 2015, at 17:37, Vijayasarathy Kannan kvi...@vt.edu wrote: Hi, I am working on a Spark

RE: Use of nscala-time within spark-shell

2015-02-17 Thread Hammam CHAMSI
I can use nscala-time with scala, but my issue is that I can't use it witinh spark-shell console! It gives my the error below. Thanks From: kevin...@apache.org Date: Tue, 17 Feb 2015 08:50:04 + Subject: Re: Use of nscala-time within spark-shell To: hscha...@hotmail.com; kevin...@apache.org;

RE: Use of nscala-time within spark-shell

2015-02-17 Thread Hammam CHAMSI
Thanks Kevin for your reply, I downloaded the pre_built version and as you said the default spark scala version is 2.10. I'm now building spark 1.2.1 with scala 2.11. I'll share the results here. Regards, From: kevin...@apache.org Date: Tue, 17 Feb 2015 01:10:09 + Subject: Re: Use of

Re: Use of nscala-time within spark-shell

2015-02-17 Thread Kevin (Sangwoo) Kim
Great, or you can just use nscala-time with scala 2.10! On Tue Feb 17 2015 at 5:41:53 PM Hammam CHAMSI hscha...@hotmail.com wrote: Thanks Kevin for your reply, I downloaded the pre_built version and as you said the default spark scala version is 2.10. I'm now building spark 1.2.1 with scala

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
What application are you running? Here's a few things: - You will hit bottleneck on CPU if you are doing some complex computation (like parsing a json etc.) - You will hit bottleneck on Memory if your data/objects used in the program is large (like defining playing with HashMaps etc inside your

Re: Use of nscala-time within spark-shell

2015-02-17 Thread Kevin (Sangwoo) Kim
Then, why don't you use nscala-time_2.10-1.8.0.jar, not nscala-time_2.11-1.8.0.jar ? On Tue Feb 17 2015 at 5:55:50 PM Hammam CHAMSI hscha...@hotmail.com wrote: I can use nscala-time with scala, but my issue is that I can't use it witinh spark-shell console! It gives my the error below.

RE: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Mohammed Guller
Where did you look? BTW, it is defined in the RDD class as a val: val partitioner: Option[Partitioner] Mohammed -Original Message- From: Darin McBeath [mailto:ddmcbe...@yahoo.com.INVALID] Sent: Tuesday, February 17, 2015 1:45 PM To: User Subject: How do you get the partitioner for

RE: Use of nscala-time within spark-shell

2015-02-17 Thread Hammam CHAMSI
My fault, I didn't notice the 11 in the jar name. It is working now with nscala-time_2.10-1.8.0.jar Thanks Kevin From: kevin...@apache.org Date: Tue, 17 Feb 2015 08:58:13 + Subject: Re: Use of nscala-time within spark-shell To: hscha...@hotmail.com; kevin...@apache.org;

Re: Configration Problem? (need help to get Spark job executed)

2015-02-17 Thread Arush Kharbanda
Hi It could be due to the connectivity issue between the master and the slaves. I have seen this issue occur for the following reasons.Are the slaves visible in the Spark UI?And how much memory is allocated to the executors. 1. Syncing of configuration between Spark Master and Slaves. 2.

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
Thank you very much for your reply! My task is to count the number of word pairs in a document. If w1 and w2 occur together in one sentence, the number of occurrence of word pair (w1, w2) adds 1. So the computational part of this algorithm is simply a two-level for-loop. Since the cluster is

Re: ERROR actor.OneForOneStrategy: org.apache.hadoop.conf.Configuration

2015-02-17 Thread Sean Owen
Tip: to debug where the unserializable reference comes from, run with -Dsun.io.serialization.extendeddebuginfo=true On Tue, Feb 17, 2015 at 10:20 AM, Jadhav Shweta jadhav.shw...@tcs.com wrote: 15/02/17 12:57:10 ERROR actor.OneForOneStrategy: org.apache.hadoop.conf.Configuration

Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Arush Kharbanda
Hi How big is your dataset? Thanks Arush On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate jalaf...@eng.ucsd.edu wrote: Thank you very much for your reply! My task is to count the number of word pairs in a document. If w1 and w2 occur together in one sentence, the number of occurrence of

Cleanup Questions

2015-02-17 Thread Ashic Mahtab
Two questions regarding worker cleanup: 1) Is the best place to enable worker cleanup setting export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is there a better place? 2) I see this has a default TTL of 7

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-17 Thread Emre Sevinc
I've decided to try spark-submit ... --conf spark.driver.extraJavaOptions=-DpropertiesFile=/home/emre/data/myModule.properties But when I try to retrieve the value of propertiesFile via System.err.println(propertiesFile : + System.getProperty(propertiesFile)); I get NULL: