Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread Nick Pentreath
Can you key your RDD by some key and use reduceByKey? In fact if you are merging bunch of maps you can create a set of (k, v) in your mapPartitions and then reduceByKey using some merge function. The reduce will happen in parallel on multiple nodes in this case. You'll end up with just a single

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread DB Tsai
Hi Nick, How does reduce work? I thought after reducing in the executor, it will reduce in parallel between multiple executors instead of pulling everything to driver and reducing there. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Writing data to HBase using Spark

2014-06-10 Thread Vibhor Banga
Hi, I am reading data from a HBase table to RDD and then using foreach on that RDD I am doing some processing on every Result of HBase table. After this processing I want to store the processed data back to another HBase table. How can I do that ? If I use standard Hadoop and HBase classes to

Re: Writing data to HBase using Spark

2014-06-10 Thread Kanwaldeep
Please see sample code attached at https://issues.apache.org/jira/browse/SPARK-944. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Writing-data-to-HBase-using-Spark-tp7304p7305.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Shark over Spark-Streaming

2014-06-10 Thread praveshjain1991
Is it possible to use Shark over Streaming data? I did not find any mention of that on the website. When you run shark it gives you a shell to run your queries for stored data. Is there any way to do the same over streaming data? -- Thanks -- View this message in context:

RE: Is Spark-1.0.0 not backward compatible with Shark-0.9.1 ?

2014-06-10 Thread Cheng, Hao
And if you want to use the SQL CLI (based on catalyst) as it works in Shark, you can also check out https://github.com/amplab/shark/pull/337 :) This preview version doesn’t require the Hive to be setup in the cluster. (Don’t forget to put the hive-site.xml under SHARK_HOME/conf also) Cheng Hao

Re: Spark 1.0.0 Maven dependencies problems.

2014-06-10 Thread toivoa
Thanks for the hint. I removed signature info from same jar and JVM is happy now. But problem remains, several same jar's but different versions, not good. Spark itself is very, very promising, I am very excited Thank you all toivo -- View this message in context:

Problem in Spark Streaming

2014-06-10 Thread nilmish
I am running a spark streaming job to count top 10 hashtags over last 5 mins window, querying every 1 sec. It is taking approx 1.4 sec (end-to-end-delay) to answer most of the query but there are few instances in between when it takes considerable more amount of time (like around 15 sec) due to

Re: Problem in Spark Streaming

2014-06-10 Thread Yingjun Wu
Hi Nilmish, I confront the same problem. I am wondering how do you measure the latency? Regards, Yingjun -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7311.html Sent from the Apache Spark User List mailing list archive

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
You can measure the latency from the logs. Search for words like Total delay in the logs. This denotes the total end to end delay for a particular query. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Problem-in-Spark-Streaming-tp7310p7312.html Sent from

pmml with augustus

2014-06-10 Thread filipus
hello guys, has anybody experiances with the library augustus as a serializer for scoring models? looks very promising and i even found a hint on the connection augustus and spark all the best -- View this message in context:

Re: pmml with augustus

2014-06-10 Thread Sean Owen
It's worth mentioning that Augustus is a Python-based library. On a related note, in Java-land, I have had good experiences with jpmml's projects: On Tue, Jun 10, 2014 at 7:52 AM, filipus floe...@gmail.com wrote: hello guys, has anybody experiances with the library augustus as a serializer

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hello! Spark Streaming supports HDFS as input source, and also Akka actor receivers, or TCP socket receivers. For my use case I think it's probably more convenient to read the data directly from Actors, because I already need to set up a multi-node Akka cluster (on the same nodes that Spark runs

Calling JavaPairRDD.first after calling JavaPairRDD.groupByKey results in NullPointerException

2014-06-10 Thread Gaurav Jain
I am getting a strange null pointer exception when trying to list the first entry of a JavaPairRDD after calling groupByKey on it. Following is my code: JavaPairRDDTuple3lt;String, String, String, ListString KeyToAppList = KeyToApp.distinct().groupByKey();

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
Hey Nilesh, Great to hear your using Spark Streaming, in my opinion the crux of your question comes down to what you want to do with the data in the future and/or if there is utility it using it from more than one Spark/Streaming job. 1). *One-time-use fire and forget *- as you rightly point

Re: Problem in Spark Streaming

2014-06-10 Thread Boduo Li
Hi Nilmish, What's the data rate/node when you see the high latency? (It seems the latency keeps increasing.) Do you still see it if you lower the data rate or the frequency of the windowed query? -- View this message in context:

Re: Problem in Spark Streaming

2014-06-10 Thread nilmish
How can I measure data rate/node ? I am feeding the data through kafka API. I only know the total inflow data rate which almost remains constant . How can I figure out what amount of data is distributed to the nodes in my cluster ? Latency does not keep on increasing infinetly. It goes up for

Re: abnormal latency when running Spark Streaming

2014-06-10 Thread Boduo Li
Hi Yingjun, Do you see a stable latency or the latency keeps increasing? And could you provide some details about the input data rate/node, batch interval, windowDuration and slideDuration when you see the high latency? -- View this message in context:

Spark Streaming socketTextStream

2014-06-10 Thread fredwolfinger
Good morning, I have taken the socketTextStream example and instead of running on a local Spark instance, I have pushed it to my Spark cluster in AWS (1 master with 5 slave nodes). I am getting the following error that appears to indicate that all the slaves are trying to read from localhost:

Re: Problem in Spark Streaming

2014-06-10 Thread Boduo Li
Oh, I mean the average data rate/node. But in case I want to know the input activities to each node (I use a custom receiver instead of Kafka), I usually search these records in logs to get a sense: BlockManagerInfo: Added input ... on [hostname:port] (size: xxx KB) I also see some spikes in

Re: Spark Streaming socketTextStream

2014-06-10 Thread Akhil Das
You can use the master's IP address (Or whichever machine you chose to run the nc command) instead of localhost.

Re: Spark Streaming socketTextStream

2014-06-10 Thread fredwolfinger
Worked! Thanks so much! Fred Fred Wolfinger Research Staff Member, CyberPoint Labs direct +1 410 779 6741 mobile +1 443 655 3322 CyberPoint International 621 East Pratt Street, Suite 300 Baltimore MD 21202-3140 phone +1 410 779 6700 www.cyberpointllc.com http://www.cyberpointllc.com/

Re: pmml with augustus

2014-06-10 Thread Evan R. Sparks
I should point out that if you don't want to take a polyglot approach to languages and reside solely in the JVM, then you can just use plain old java serialization on the Model objects that come out of MLlib's APIs from Java or Scala and load them up in another process and call the relevant

HDFS Server/Client IPC version mismatch while trying to access HDFS files using Spark-0.9.1

2014-06-10 Thread bijoy deb
Hi all, I have build Shark-0.9.1 using sbt using the below command: *SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.6.0 sbt/sbt assembly* My Hadoop cluster is also having version 2.0.0-mr1-cdh4.6.0. But when I try to execute the below command from Spark shell,which reads a file from HDFS, I get the IPC

Re: Can't find pyspark when using PySpark on YARN

2014-06-10 Thread Andrew Or
Hi Qi Ping, You don't have to distribute these files; they are automatically packaged in the assembly jar, which is already shipped to the worker nodes. Other people have run into the same issue. See if the instructions here are of any help:

Re: Spark Logging

2014-06-10 Thread Surendranauth Hiraman
Event logs are different from writing using a logger, like log4j. The event logs are the type of data showing up in the history server. For my team, we use com.typesafe.scalalogging.slf4j.Logging. Our logs show up in /etc/spark/work/app-id/executor-id/stderr and stdout. All of our logging seems

getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Question on the input and output for ALS.train() and MatrixFactorizationModel.predict(). My input is list of Ratings(user_id, product_id, rating) and my ratings are one a scale of 1-5 (inclusive). When I compute predictions over the superset of all (user_id, product_id) pairs, the ratings

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread mpieck
Hi, I have the same problem when running Kafka to Spark Streaming pipeline from Java with explicitely specified message decoders. I had thought, that it was related to Eclipse environment, as suggested here, but it's not the case. I have coded an example based on class:

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread Michael Chang
I had this same problem as well. I ended up just adding the necessary code in KafkaUtil and compiling my own spark jar. Something like this for the raw stream: def createRawStream( jssc: JavaStreamingContext, kafkaParams: JMap[String, String], topics: JMap[String, JInt]

Re: getting started with mllib.recommendation.ALS

2014-06-10 Thread Sean Owen
For trainImplicit(), the output is an approximation of a matrix of 0s and 1s, so the values are generally (not always) in [0,1] But for train(), you should be predicting the original input matrix as-is, as I understand. You should get output in about the same range as the input but again not

Re: NoSuchMethodError in KafkaReciever

2014-06-10 Thread Sean Owen
I added https://issues.apache.org/jira/browse/SPARK-2103 to track this. I also ran into it. I don't have a fix, but, somehow I think someone with more understanding of Scala and Manifest objects might see the easy fix. On Tue, Jun 10, 2014 at 5:15 PM, mpieck mpi...@gazeta.pl wrote: Hi, I have

Re: Information on Spark UI

2014-06-10 Thread coderxiang
The executors shown CANNOT FIND ADDRESS are not listed in the Executors Tab on the top of the Spark UI. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Information-on-Spark-UI-tp7354p7355.html Sent from the Apache Spark User List mailing list archive at

spark streaming, kafka, SPARK_CLASSPATH

2014-06-10 Thread lannyripple
I am using Spark 1.0.0 compiled with Hadoop 1.2.1. I have a toy spark-streaming-kafka program. It reads from a kafka queue and does stream .map {case (k, v) = (v, 1)} .reduceByKey(_ + _) .print() using a 1 second interval on the stream. The docs say to make Spark and

groupBy question

2014-06-10 Thread SK
After doing a groupBy operation, I have the following result: val res = (ID1,ArrayBuffer((145804601,ID1,japan))) (ID3,ArrayBuffer((145865080,ID3,canada), (145899640,ID3,china))) (ID2,ArrayBuffer((145752760,ID2,usa), (145934200,ID2,usa))) Now I need to output for each group,

Monitoring spark dis-associated workers

2014-06-10 Thread Allen Chang
We're running into an issue where periodically the master loses connectivity with workers in the spark cluster. We believe this issue tends to manifest when the cluster is under heavy load, but we're not entirely sure when it happens. I've seen one or two other messages to this list about this

problem starting the history server on EC2

2014-06-10 Thread zhen
I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history server on the master node. I used the following command: ./start-history-server.sh /root/spark_log The error message says that the logging directory /root/spark_log does not

Re: problem starting the history server on EC2

2014-06-10 Thread bc Wong
What's the permission on /root itself? On Jun 10, 2014 6:29 PM, zhen z...@latrobe.edu.au wrote: I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history server on the master node. I used the following command:

output tuples in CSV format

2014-06-10 Thread SK
My output is a set of tuples and when I output it using saveAsTextFile, my file looks as follows: (field1_tup1, field2_tup1, field3_tup1,...) (field1_tup2, field2_tup2, field3_tup2,...) In Spark. is there some way I can simply have it output in CSV format as follows (i.e. without the

Re: Using Spark on Data size larger than Memory size

2014-06-10 Thread Allen Chang
Thanks for the clarification. What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles. I see that the default storage level for

Re: Information on Spark UI

2014-06-10 Thread Neville Li
We are seeing this issue as well. We run on YARN and see logs about lost executor. Looks like some stages had to be re-run to compute RDD partitions lost in the executor. We were able to complete 20 iterations with 20% full matrix but not beyond that (total 100GB). On Tue, Jun 10, 2014 at 8:32

Re: output tuples in CSV format

2014-06-10 Thread Mikhail Strebkov
you can just use something like this: myRdd(_.productIterator.mkString(,)).saveAsTextFile On Tue, Jun 10, 2014 at 6:34 PM, SK skrishna...@gmail.com wrote: My output is a set of tuples and when I output it using saveAsTextFile, my file looks as follows: (field1_tup1, field2_tup1,

RE: output tuples in CSV format

2014-06-10 Thread Shao, Saisai
It would be better to add one more transformation step before saveAsTextFile, like: rdd.map(tuple = %s,%s,%s.format(tuple._1, tuple._2, tuple._3)).saveAsTextFile(...) By manually convert to the format you what, and then write to HDFS. Thanks Jerry -Original Message- From: SK

Re: How to process multiple classification with SVM in MLlib

2014-06-10 Thread littlebird
Thanks. Now I know how to broadcast the dataset but I still wonder after broadcasting the dataset how can I apply my algorithm to training the model in the wokers. To describe my question in detail, The following code is used to train LDA(Latent Dirichlet Allocation) model with JGibbLDA in single

Re: problem starting the history server on EC2

2014-06-10 Thread zhen
I checked the permission on root and it is the following: drwxr-xr-x 20 root root 4096 Jun 11 01:05 root So anyway, I changed to use /tmp/spark_log instead and this time I made sure that all permissions are given to /tmp and /tmp/spark_log like below. But it still does not work: drwxrwxrwt 8

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
Can you try file:/root/spark_log? 2014-06-10 19:22 GMT-07:00 zhen z...@latrobe.edu.au: I checked the permission on root and it is the following: drwxr-xr-x 20 root root 4096 Jun 11 01:05 root So anyway, I changed to use /tmp/spark_log instead and this time I made sure that all

Re: How to process multiple classification with SVM in MLlib

2014-06-10 Thread littlebird
Someone suggests me to use Mahout, but I'm not familiar with it. And in that case, using Mahout will add difficulties to my program. I'd like to run the algorithm in Spark. I'm a beginner, can you give me some suggestions? -- View this message in context:

Question about RDD cache, unpersist, materialization

2014-06-10 Thread innowireless TaeYun Kim
Hi, What I (seems to) know about RDD persisting API is as follows: - cache() and persist() is not an action. It only does a marking. - unpersist() is also not an action. It only removes a marking. But if the rdd is already in memory, it is unloaded. And there seems no API to forcefully

Re: getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Thanks Sean. I realized that I was supplying train() with a very low rank so I will retry with something higher and then play with lambda as-needed. On Tue, Jun 10, 2014 at 4:58 PM, Sean Owen so...@cloudera.com wrote: For trainImplicit(), the output is an approximation of a matrix of 0s and

RE: Question about RDD cache, unpersist, materialization

2014-06-10 Thread innowireless TaeYun Kim
BTW, it is possible that rdd.first() does not compute the whole partitions. So, first() cannot be uses for the situation below. -Original Message- From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] Sent: Wednesday, June 11, 2014 11:40 AM To: user@spark.apache.org

Re: Problem in Spark Streaming

2014-06-10 Thread Ashish Rangole
Have you considered the garbage collection impact and if it coincides with your latency spikes? You can enable gc logging by changing Spark configuration for your job. Hi, as I searched the keyword Total delay in the console log, the delay keeps increasing. I am not sure what does this total delay

Re: problem starting the history server on EC2

2014-06-10 Thread Andrew Or
No, I meant pass the path to the history server start script. 2014-06-10 19:33 GMT-07:00 zhen z...@latrobe.edu.au: Sure here it is: drwxrwxrwx 2 1000 root 4096 Jun 11 01:05 spark_logs Zhen -- View this message in context:

Re: How to specify executor memory in EC2 ?

2014-06-10 Thread Matei Zaharia
It might be that conf/spark-env.sh on EC2 is configured to set it to 512, and is overriding the application’s settings. Take a look in there and delete that line if possible. Matei On Jun 10, 2014, at 2:38 PM, Aliaksei Litouka aliaksei.lito...@gmail.com wrote: I am testing my application in

Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar
Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers k/ On Tue, Jun 10, 2014 at 6:29 PM, zhen z...@latrobe.edu.au wrote: I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I do not seem to be able to start the history