merge elements in a Spark RDD under custom condition

2014-12-01 Thread Pengcheng YIN
Hi Pro, I want to merge elements in a Spark RDD when the two elements satisfy certain condition Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD. For

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread lokeshkumar
The workaround was to wrap the map returned by spark libraries into HashMap and then broadcast them. Could anyone please let me know if there is any issue open? -- View this message in context:

akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down

2014-12-01 Thread Alexey Romanchuk
Hello spark users! I found lots of strange messages in driver log. Here it is: 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25] ERROR

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new JIRA and linking it to that one? On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote: The workaround was to wrap the map returned by spark libraries into HashMap

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-01 Thread cjdc
Hi Vikas and Simone, thanks for the replies. Yeah I understand this would be easier with 1.2 but this is completely out of my control. I really have to work with 1.0.0. About Simone's approach, during the imports I get: /scala import org.apache.avro.mapreduce.{ AvroJob, AvroKeyInputFormat,

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Niranda Perera
Hi Michael, About this new data source API, what type of data sources would it support? Does it have to be RDBMS necessarily? Cheers On Sat, Nov 29, 2014 at 12:57 AM, Michael Armbrust mich...@databricks.com wrote: You probably don't need to create a new kind of SchemaRDD. Instead I'd

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-12-01 Thread Ishwardeep Singh
Hi Judy, Thank you for your response. When I try to compile using maven mvn -Dhadoop.version=1.2.1 -DskipTests clean package I get an error Error: Could not find or load main class . I have maven 3.0.4. And when I run command sbt package I get the same exception as earlier. I have done the

Kryo exception for CassandraSQLRow

2014-12-01 Thread shahab
I am using Cassandra-Spark connector to pull data from Cassandra, process it and write it back to Cassandra. Now I am getting the following exception, and apparently it is Kryo serialisation. Does anyone what is the reason and how this can be solved? I also tried to register

Spark 1.1.0: weird spark-shell behavior

2014-12-01 Thread Reinis Vicups
Hello, I have two weird effects when working with spark-shell: 1. This code executed in spark-shell causes an exception below. At the same time it works perfectly when submitted with spark-submit! : import org.apache.hadoop.hbase.{HConstants, HBaseConfiguration} import

Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi, I am integrating Kafka and Spark, using spark-streaming. I have created a topic as a kafka producer: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test I am publishing messages in kafka and trying to read them using

Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
It says: 14/11/27 11:56:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory A quick guess would be, you are giving the wrong master url. ( spark:// 192.168.88.130:7077 ) Open the

RE: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread m.sarosh
Hi, The spark master is working, and I have given the same url in the code: [cid:image001.png@01D00D82.6DC2FFF0] The warning is gone, and the new log is: --- Time: 141742785 ms --- INFO

RE: Kryo exception for CassandraSQLRow

2014-12-01 Thread Ashic Mahtab
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking the guava exclusions might help? Date: Mon, 1 Dec 2014 10:48:25 +0100 Subject: Kryo exception for CassandraSQLRow From: shahab.mok...@gmail.com

Re: Kafka+Spark-streaming issue: Stream 0 received 0 blocks

2014-12-01 Thread Akhil Das
I see you have no worker machines to execute the job [image: Inline image 1] You haven't configured your spark cluster properly. Quick fix to get it running would be run it on local mode, for that change this line JavaStreamingContext jssc = *new* JavaStreamingContext(spark://

Re: Setting network variables in spark-shell

2014-12-01 Thread Shixiong Zhu
Don't set `spark.akka.frameSize` to 1. The max value of `spark.akka.frameSize` is 2047. The unit is MB. Best Regards, Shixiong Zhu 2014-12-01 0:51 GMT+08:00 Yanbo yanboha...@gmail.com: Try to use spark-shell --conf spark.akka.frameSize=1 在 2014年12月1日,上午12:25,Brian Dolan

Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
Hi all, I need some advise whether Spark is the right tool for my zoo. My requirements share commonalities with „big data“, workflow coordination and „reactive“ event driven data processing (as in for example Haskell Arrows), which doesn’t make it any easier to decide on a tool set. NB: I

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Not quite sure which geo processing you're doing are they raster, vector? More info will be appreciated for me to help you further. Meanwhile I can try to give some hints, for instance, did you considered GeoMesa http://www.geomesa.org/2014/08/05/spark/? Since you need a WMS (or alike), did you

ensuring RDD indices remain immutable

2014-12-01 Thread rok
I have an RDD that serves as a feature look-up table downstream in my analysis. I create it using the zipWithIndex() and because I suppose that the elements of the RDD could end up in a different order if it is regenerated at any point, I cache it to try and ensure that the (feature -- index)

Re: Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
Thank you for mentioning GeoTrellis. I haven’t heard of this before. We have many custom tools and steps, I’ll check our tools fit in. The end result after is actually a 3D map for native OpenGL based rendering on iOS / Android [1]. I’m using GeoPackage which is basically SQLite with R-Tree and

Re: Is Spark the right tool for me?

2014-12-01 Thread Stadin, Benjamin
… Sorry, I forgot to mention why I’m basically bound to SQLite. The workflow involves more data processings than I mentioned. There are several tools in the chain which either rely on SQLite as exchange format, or processings like data cleaning that are done orders of magnitude faster / or

How take top N of top M from RDD as RDD

2014-12-01 Thread Xuefeng Wu
Hi, I have a problem, it is easy in Scala code, but I can not take the top N from RDD as RDD. There are 1 Student Score, ask take top 10 age, and then take top 10 from each age, the result is 100 records. The Scala code is here, but how can I do it in RDD, *for RDD.take return is Array,

Re: Is Spark the right tool for me?

2014-12-01 Thread andy petrella
Indeed. However, I guess the important load and stress is in the processing of the 3D data (DEM or alike) into geometries/shades/whatever. Hence you can use spark (geotrellis can be tricky for 3D, poke @lossyrob for more info) to perform these operations then keep an RDD of only the resulting

Re: ensuring RDD indices remain immutable

2014-12-01 Thread Sean Owen
I think the robust thing to do is sort the RDD, and then zipWithIndex. Even if the RDD is recomputed, the ordering and thus assignment of IDs should be the same. On Mon, Dec 1, 2014 at 2:36 PM, rok rokros...@gmail.com wrote: I have an RDD that serves as a feature look-up table downstream in my

Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time slice. Message sample Item ID Item Type timeStamp 1 X 1-12-2014:12:01 1 X 1-12-2014:12:02 1 X

Problem creating EC2 cluster using spark-ec2

2014-12-01 Thread Dave Challis
I've been trying to create a Spark cluster on EC2 using the documentation at https://spark.apache.org/docs/latest/ec2-scripts.html (with Spark 1.1.1). Running the script successfully creates some EC2 instances, HDFS etc., but appears to fail to copy the actual files needed to run Spark across. I

Re: ensuring RDD indices remain immutable

2014-12-01 Thread rok
true though I was hoping to avoid having to sort... maybe there's no way around it. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html Sent from the Apache Spark User List mailing list archive at

Re: Spark SQL 1.0.0 - RDD from snappy compress avro file

2014-12-01 Thread cjdc
btw the same error from above also happen on 1.1.0 (just tested) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-0-RDD-from-snappy-compress-avro-file-tp19998p20106.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Ritesh Kumar Singh
For converting an Array or any List to a RDD, we can try using : sc.parallelize(groupedScore)//or whatever the name of the list variable is On Mon, Dec 1, 2014 at 8:14 PM, Xuefeng Wu ben...@gmail.com wrote: Hi, I have a problem, it is easy in Scala code, but I can not take the top N

Re: Spark Job submit

2014-12-01 Thread Matt Narrell
Or setting the HADOOP_CONF_DIR property. Either way, you must have the YARN configuration available to the submitting application to allow for the use of “yarn-client” or “yarn-master” The attached stack trace below doesn’t provide any information as to why the job failed. mn On Nov 27,

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain
Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, pankaj pankaje...@gmail.com wrote: Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time

Re: Mllib native netlib-java/OpenBLAS

2014-12-01 Thread agg212
Thanks for your reply, but I'm still running into issues installing/configuring the native libraries for MLlib. Here are the steps I've taken, please let me know if anything is incorrect. - Download Spark source - unzip and compile using `mvn -Pnetlib-lgpl -DskipTests clean package ` - Run

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi , suppose i keep batch size of 3 minute. in 1 batch there can be incoming records with any time stamp. so it is difficult to keep track of when the 3 minute interval was start and end. i am doing output operation on worker nodes in forEachPartition not in drivers(forEachRdd) so i cannot use

packaging from source gives protobuf compatibility issues.

2014-12-01 Thread akhandeshi
scala textFile.count() java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CompleteReques tProto overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; I tried ./make-distribution.sh -Dhadoop.version=2.5.0 and

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Debasish Das
rdd.top collects it on master... If you want topk for a key run map / mappartition and use a bounded priority queue and reducebykey the queues. I experimented with topk from algebird and bounded priority queue wrapped over jpriority queue ( spark default)...bpq is faster Code example is here:

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-12-01 Thread Judy Nash
Have you checked out the wiki here? http://spark.apache.org/docs/latest/building-with-maven.html A couple things I did differently from you: 1) I got the bits directly from github (https://github.com/apache/spark/). Use branch 1.1 for spark 1.1 2) execute maven command on cmd (powershell misses

How to Integrate openNLP with Spark

2014-12-01 Thread Nikhil
Hi, I am using openNLP NER ( Token Name finder ) for parsing an Unstructured data. In order to speed up my process( to quickly train a models and analyze the documents from the models ), I want to use Spark and I saw on the web that it is possible to connect openNLP with Spark using UIMAFit but I

Re: Creating a SchemaRDD from an existing API

2014-12-01 Thread Michael Armbrust
No, it should support any data source that has a schema and can produce rows. On Mon, Dec 1, 2014 at 1:34 AM, Niranda Perera nira...@wso2.com wrote: Hi Michael, About this new data source API, what type of data sources would it support? Does it have to be RDBMS necessarily? Cheers On

Minimum cluster size for empirical testing

2014-12-01 Thread Valdes, Pablo
Hi everyone, I’m interested in empirically measuring how faster spark works in comparison to Hadoop for certain problems and input corpus I currently work with (I’ve read Matei Zahari’s “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing” paper and I

Re: How to use FlumeInputDStream in spark cluster?

2014-12-01 Thread Ping Tang
Thank you very much for your reply. I have a cluster of 8 nodes: m1, m2, m3.. m8. m1 configured as Spark master node, the rest of the nodes are all worker node. I also configured m3 as the History Server. But the history server fails to start.I ran FlumeEventCount in m1 using the right

Remove added jar from spark context

2014-12-01 Thread ankits
Hi, Is there a way to REMOVE a jar (added via addJar) to spark contexts? We have a long running context used by the spark jobserver, but after trying to update versions of classes already in the class path via addJars, the context still runs the old versions. It would be helpful if I could remove

RE: Inaccurate Estimate of weights model from StreamingLinearRegressionWithSGD

2014-12-01 Thread Bui, Tri
Thanks Yanbo! That works! The only issue is that it won’t print the predicted value from lp.features, from code line below. model.predictOnValues(testData.map(lp = (lp.label, lp.features))).print() It prints the test input data correctly, but it keeps on printing “0.0” as the predicted

StreamingLinearRegressionWithSGD

2014-12-01 Thread Joanne Contact
Hi Gurus, I did not look at the code yet. I wonder if StreamingLinearRegressionWithSGD http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html is equivalent to LinearRegressionWithSGD

Spark SQL table Join, one task is taking long

2014-12-01 Thread Venkat Subramanian
Environment: Spark 1.1, 4 Node Spark and Hadoop Dev cluster - 6 cores, 32 GB Ram each. Default serialization, Standalone, no security Data was sqooped from relational DB to HDFS and Data is partitioned across HDFS uniformly. I am reading a fact table about 8 GB in size and one small dim table

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Try (hdfs:///localhost:8020/user/data/*) With 3 /. Thx tri -Original Message- From: Benjamin Cuthbert [mailto:cuthbert@gmail.com] Sent: Monday, December 01, 2014 4:41 PM To: user@spark.apache.org Subject: hdfs streaming context All, Is it possible to stream on HDFS directory

Re: hdfs streaming context

2014-12-01 Thread Andy Twigg
Have you tried just passing a path to ssc.textFileStream() ? It monitors the path for new files by looking at mtime/atime ; all new/touched files in the time window appear as an rdd in the dstream. On 1 December 2014 at 14:41, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible

Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes, in fact, that's the only way it works. You need hdfs://localhost:8020/user/data, I believe. (No it's not correct to write hdfs:///...) On Mon, Dec 1, 2014 at 10:41 PM, Benjamin Cuthbert cuthbert@gmail.com wrote: All, Is it possible to stream on HDFS directory and listen for multiple

Re: hdfs streaming context

2014-12-01 Thread Benjamin Cuthbert
Thanks Sean, That worked just removing the /* and leaving it as /user/data Seems to be streaming in. On 1 Dec 2014, at 22:50, Sean Owen so...@cloudera.com wrote: Yes, in fact, that's the only way it works. You need hdfs://localhost:8020/user/data, I believe. (No it's not correct to

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
For the streaming example I am working on, Its accepted (hdfs:///user/data) without the localhost info. Let me dig through my hdfs config. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 4:50 PM To: Benjamin Cuthbert Cc:

Re: hdfs streaming context

2014-12-01 Thread Sean Owen
Yes but you can't follow three slashes with host:port. No host probably defaults to whatever is found in your HDFS config. On Mon, Dec 1, 2014 at 11:02 PM, Bui, Tri tri@verizonwireless.com wrote: For the streaming example I am working on, Its accepted (hdfs:///user/data) without the

RE: hdfs streaming context

2014-12-01 Thread Bui, Tri
Yep. No localhost Usually, I use hdfs:///user/data to indicates I want hdfs or file:///user/data to indicates local file directory. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, December 01, 2014 5:06 PM To: Bui, Tri Cc: Benjamin Cuthbert;

Re: How take top N of top M from RDD as RDD

2014-12-01 Thread Xuefeng Wu
hi Debasish, I found test code in map translate, would it collect all products too? + val sortedProducts = products.toArray.sorted(ord.reverse) Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年12月2日, at 上午1:33, Debasish Das debasish.da...@gmail.com wrote: rdd.top collects it on master... If you want

Passing Java Options to Spark AM launching

2014-12-01 Thread Mohammad Islam
Hi,How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when lunching AM or task containers? This is related to running Spark on Yarn (Hadoop 2.3.0). In Map-reduce case, setting the property such as mapreduce.map.java.opts would do the work. Any help would be highly appreciated.

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
file = tranform file into a bunch of records What does this function do exactly? Does it load the file locally? Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each file

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Actually, I'm working with a binary format. The api allows reading out a single record at a time, but I'm not sure how to get those records into spark (without reading everything into memory from a single file at once). On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg andy.tw...@gmail.com wrote:

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Tobias Pfeiffer
Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up last week) and spark.executor.extraJavaOptions at http://spark.apache.org/docs/latest/configuration.html#runtime-environment. Tobias

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Mohammad Islam
Thanks Tobias for the answer.Does it work for driver as well? Regards,Mohammad On Monday, December 1, 2014 5:30 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, have a look at the documentation for spark.driver.extraJavaOptions (which seems to have disappeared since I looked it up

Re: Passing Java Options to Spark AM launching

2014-12-01 Thread Zhan Zhang
Please check whether https://github.com/apache/spark/pull/3409#issuecomment-64045677 solve the problem for launching AM. Thanks. Zhan Zhang On Dec 1, 2014, at 4:49 PM, Mohammad Islam misla...@yahoo.com.INVALID wrote: Hi, How to pass the Java options (such as -XX:MaxMetaspaceSize=100M) when

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
Could you modify your function so that it streams through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory. On 1 December 2014 at 17:19, Keith Simmons ke...@pulse.io wrote: Actually, I'm working with a

numpy arrays and spark sql

2014-12-01 Thread Joseph Winston
This works as expected in the 1.1 branch: from pyspark.sql import * rdd = sc.parallelize([range(0, 10), range(10,20), range(20, 30)] # define the schema schemaString = value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 fields = [StructField(field_name, IntegerType(), True)

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a simpler (and perhaps more efficient) approach. Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote: Could you modify your function so that it streams

Re: ALS failure with size Integer.MAX_VALUE

2014-12-01 Thread Bharath Ravi Kumar
Yes, the issue appears to be due to the 2GB block size limitation. I am hence looking for (user, product) block sizing suggestions to work around the block size limitation. On Sun, Nov 30, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote: (It won't be that, since you see that the error occur

Re: Calling spark from a java web application.

2014-12-01 Thread ryaminal
If you are able to use YARN in your hadoop cluster, then the following technique is pretty straightforward: http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ We use this in our system and it's super easy to execute from our Tomcat application. -- View this message in context:

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Andy Twigg
You may be able to construct RDDs directly from an iterator - not sure - you may have to subclass your own. On 1 December 2014 at 18:40, Keith Simmons ke...@pulse.io wrote: Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a

Re: numpy arrays and spark sql

2014-12-01 Thread Davies Liu
applySchema() only accept RDD of Row/list/tuple, it does not work with numpy.array. After applySchema(), the Python RDD will be pickled and unpickled in JVM, so you will not have any benefit by using numpy.array. It will work if you convert ndarray into list: schemaRDD =

java.io.IOException: Filesystem closed

2014-12-01 Thread rapelly kartheek
Hi, I face the following exception when submit a spark application. The log file shows: 14/12/02 11:52:58 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:689) at

Re: java.io.IOException: Filesystem closed

2014-12-01 Thread Akhil Das
What is the application that you are submitting? Looks like you might have invoked fs inside the app and then closed it within it. Thanks Best Regards On Tue, Dec 2, 2014 at 11:59 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I face the following exception when submit a spark