Re: CUDA in spark, especially in MLlib?

2014-08-26 Thread Antonio Jesus Navarro
Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia : > You should try to find a Java-based library, then you can call it from > Scala. > > Matei > > On August 26, 2014 at 6:58:11 PM, Wei Ta

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
The framework have those info to manage cluster status, and these info (e.g. worker number) is also available through spark metrics system. While from the user application's point of view, can you give an example why you need these info, what would you plan to do with them? Best Regards, Raymond

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
Ah, thanks. On Tue, Aug 26, 2014 at 7:32 PM, Nan Zhu wrote: > Hi, Victor, > > the issue for you to have different version in driver and cluster is that > you the master will shutdown your application due to the inconsistent > SerialVersionID in ExecutorState > > Best, > > -- > Nan Zhu > > On Tu

Re: Spark on Hadoop with Java 8

2014-08-26 Thread Matei Zaharia
You could always run HBase on a different version of Java if it has a problem on 8. We only talk to it through RPCs so I'm fairly sure the HBase client on Java 8 will talk to an HBase server on Java 7 just fine. Matei On August 26, 2014 at 11:06:59 PM, jatinpreet (jatinpr...@gmail.com) wrote:

Spark on Hadoop with Java 8

2014-08-26 Thread jatinpreet
Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running wi

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Thanks Chris and Bharat for your inputs. I agree, running multiple receivers/dstreams is desirable for scalability and fault tolerant. and this is easily doable. In present KafkaReceiver I am creating as many threads for each kafka topic partitions, but I can definitely create multiple KafkaReceive

Re: What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
We're a single-app deployment so we want to launch as many executors as the system has workers. We accomplish this by not configuring the max for the application. However, is there really no way to inspect what machines/executor ids/number of workers/etc is available in context? I'd imagine that th

Re: Spark Streaming Output to DB

2014-08-26 Thread Mayur Rustagi
I would suggest you to use JDBC connector in mappartition instead of maps as JDBC connections are costly & can really impact your performance. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Aug 26, 2014 at 6:45 PM,

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Chris Fregly
great work, Dibyendu. looks like this would be a popular contribution. expanding on bharat's question a bit: what happens if you submit multiple receivers to the cluster by creating and unioning multiple DStreams as in the kinesis example here: https://github.com/apache/spark/blob/ae58aea2d1435

Execute HiveFormSpark ERROR.

2014-08-26 Thread CharlieLin
hi, all : I tried to use Spark SQL on spark-shell, as the spark-example. When I execute : *val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") *then report error like below: scala> hiveContext.hql("CRE

Execute HiveFormSpark ERROR.

2014-08-26 Thread CharlieLin
hi, all : I tried to use Spark SQL on spark-shell, as the spark-example. When I execute : *val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") *then report error like below: scala> hiveContext.hql("CRE

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Nan Zhu
Hi, Victor, the issue for you to have different version in driver and cluster is that you the master will shutdown your application due to the inconsistent SerialVersionID in ExecutorState Best, -- Nan Zhu On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote: > Things will defini

Ask for help, how to integrate Sparkstreaming and IBM MQ

2014-08-26 Thread 35597...@qq.com
hi, dear Now I am working on a project in below scenario. We will use Sparkingstreaming to receive data from IBM MQ, I checked the API document of streaming, it's only support ZeroMQ, Kafka, etc. I have some questions: 1. we can use MQTT protocol to get data in this scenario, right? any other

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Things will definitely compile, and apps compiled on 1.0.0 should even be able to link against 1.0.2 without recompiling. The only problem is if you run your driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in executors. For Mesos and YARN vs standalone, the difference is that they

Re: CUDA in spark, especially in MLlib?

2014-08-26 Thread Matei Zaharia
You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active

RE: What is a Block Manager?

2014-08-26 Thread Liu, Raymond
Basically, a Block Manager manages the storage for most of the data in spark, name a few: block that represent a cached RDD partition, intermediate shuffle data, broadcast data etc. it is per executor, while in standalone mode, normally, you have one executor per worker. You don't control how m

CUDA in spark, especially in MLlib?

2014-08-26 Thread Wei Tan
Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
Yes, we are standalone right now. Do you have literature why one would want to consider Mesos or YARN for Spark deployments? Sounds like I should try upgrading my project and seeing if everything compiles without modification. Then I can connect to an existing 1.0.0 cluster and see what what happe

RE: spark.default.parallelism bug?

2014-08-26 Thread Liu, Raymond
Hi Grzegorz From my understanding, for cogroup operation ( which used by intersection), if spark.default.parallelism is not set by user, it won’t bother to use the default value, it will use the partition number ( the max one among all the rdds in cogroup operation) to build up a parti

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Matei Zaharia
You can use sc.wholeTextFiles to read each file as a complete String, though it requires each file to be small enough for one task to process. On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote: i've seen this done using mapPartitions() where each partition represents a sin

Re: Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Matei Zaharia
Is this a standalone mode cluster? We don't currently make this guarantee, though it will likely work in 1.0.0 to 1.0.2. The problem though is that the standalone mode grabs the executors' version of Spark code from what's installed on the cluster, while your driver might be built against anothe

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread Ankur Dave
At 2014-08-26 01:20:09 -0700, BertrandR wrote: > I actually tried without unpersisting, but given the performance I tryed to > add these in order to free the memory. After your anwser I tried to remove > them again, but without any change in the execution time... This is probably a related issue

Upgrading 1.0.0 to 1.0.2

2014-08-26 Thread Victor Tso-Guillen
I wanted to make sure that there's full compatibility between minor releases. I have a project that has a dependency on spark-core so that it can be a driver program and that I can test locally. However, when connecting to a cluster you don't necessarily know what version you're connecting to. Is a

Specifying classpath

2014-08-26 Thread Ashish Jain
Hello, I'm using the following version of Spark - 1.0.0+cdh5.1.0+41 (1.cdh5.1.0.p0.27). I've tried to specify the libraries Spark uses using the following ways - 1) Adding it to spark context 2) Specifying the jar path in a) spark.executor.extraClassPath b) spark.executor.extraLibraryPath 3)

Re: Spark-Streaming collect/take functionality.

2014-08-26 Thread Chris Fregly
good suggestion, td. and i believe the optimization that jon.burns is referring to - from the big data mini course - is a step earlier: the sorting mechanism that produces sortedCounts. you can use mapPartitions() to get a top k locally on each partition, then shuffle only (k * # of partitions)

Re: Parsing Json object definition spanning multiple lines

2014-08-26 Thread Chris Fregly
i've seen this done using mapPartitions() where each partition represents a single, multi-line json file. you can rip through each partition (json file) and parse the json doc as a whole. this assumes you use sc.textFile("/*.json") or equivalent to load in multiple files at once. each json file

Re: countByWindow save the count ?

2014-08-26 Thread Josh J
Thanks. I''m just confused on the syntax, I'm not sure which variables or where the value of the count is stored so that I can save it. Any examples or tips? On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov wrote: > You could try to use foreachRDD on the result of countByWindow with a > function

Re: Does HiveContext support Parquet?

2014-08-26 Thread lyc
Hi Silvio, I re-downloaded hive-0.12-bin and reset the related path in spark-env.sh. However, I still got some error. Do you happen to know any step I did wrong? Thank you! My detailed step is as follows: #enter spark-shell (successful) /bin/spark-shell --master spark://S4:7077 --jars /home/hdus

Re: Out of memory on large RDDs

2014-08-26 Thread Andrew Ash
Hi Grega, Did you ever get this figured out? I'm observing the same issue in Spark 1.0.2. For me it was after 1.5hr of a large .distinct call, followed by a .saveAsTextFile() 14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18500 14/08/26 20:57:43 INFO executor.

Re: saveAsTextFile hangs with hdfs

2014-08-26 Thread Burak Yavuz
Hi David, Your job is probably hanging on the groupByKey process. Probably GC is kicking in and the process starts to hang or the data is unbalanced and you end up with stragglers (Once GC kicks in you'll start to get the connection errors you shared). If you don't care about the list of value

Re: OutofMemoryError when generating output

2014-08-26 Thread Burak Yavuz
Hi, The error doesn't occur during saveAsTextFile but rather during the groupByKey as far as I can tell. We strongly urge users to not use groupByKey if they don't have to. What I would suggest is the following work-around: sc.textFile(baseFile)).map { line => val fields = line.split("\t") (

Re: disable log4j for spark-shell

2014-08-26 Thread Aaron
If someone doesn't have the access to do that is there any easy to specify a different properties file to be used? Patrick Wendell wrote > If you want to customize the logging behavior - the simplest way is to > copy > conf/log4j.properties.tempate to conf/log4j.properties. Then you can go > and

Spark 1.1. doesn't work with hive context

2014-08-26 Thread S Malligarjunan
Hello all, I have just checked out branch-1.1  and executed below command ./bin/spark-shell --driver-memory 1G val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT,

OutofMemoryError when generating output

2014-08-26 Thread SK
Hi, I have the following piece of code that I am running on a cluster with 10 nodes with 2GB memory per node. The tasks seem to complete, but at the point where it is generating output (saveAsTextFile), the program freezes after some time and reports an out of memory error (error transcript attach

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
Confirmed. Works now. Thanks Matei. (BTW, on OS X Command + Shift + R also refreshes the page without cache.) On Tue, Aug 26, 2014 at 3:06 PM, Matei Zaharia wrote: > It should be fixed now. Maybe you have a cached version of the page in > your browser. Open DevTools (cmd-shift-I), press the ge

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Matei Zaharia
It should be fixed now. Maybe you have a cached version of the page in your browser. Open DevTools (cmd-shift-I), press the gear icon, and check "disable cache while devtools open", then refresh the page to refresh without cache. Matei On August 26, 2014 at 7:31:18 AM, Nicholas Chammas (nichola

Kinesis receiver & spark streaming partition

2014-08-26 Thread Wei Liu
We are exploring using Kinesis and spark streaming together. I took at a look at the kinesis receiver code in 1.1.0. I have a question regarding kinesis partition & spark streaming partition. It seems to be pretty difficult to align these partitions. Kinesis partitions a stream of data into shards

Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT 2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again lat

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Du Li
Yin, Thanks for your response. It turned out that the problem was caused transitively by the version of Guava. The version I used was 17.0. Switching it back to 14.0.1 fixed the problem. The reason is that some method was public in 14.0.1 but no longer public after 15.0. Hence an IllegalAccess

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector = sc.bro

Re: HiveContext ouput log file

2014-08-26 Thread S Malligarjunan
Hello Michel, I get the following error if i execute the query with collect method Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found. at org.apache.hadoop.io.compress.CompressionCod   Thanks and Regards, Sankar S.   On Tuesday, 26 Au

Re: SPARK Hive Context UDF Class Not Found Exception,

2014-08-26 Thread S Malligarjunan
Hello Michel, I have executed git pull now, As per pom, version entry it is 1.1.0-SNAPSHOT.   Thanks and Regards, Sankar S.   On Tuesday, 26 August 2014, 1:00, Michael Armbrust wrote: Which version of Spark SQL are you using?  Several issues with custom hive UDFs have been fixed in 1.1.

Spark Streaming - Small file in HDFS

2014-08-26 Thread Ravi Sharma
Hi People, I'm using java kafka spark streaming and saving the result file into hdfs. As per my understanding, spark streaming write every processed message or event to hdfs file. Reason to creating one file per message or event could be to ensure fault tolerance. Is there any way spark handle th

Submit to the "Powered By Spark" Page!

2014-08-26 Thread Patrick Wendell
Hi All, I want to invite users to submit to the Spark "Powered By" page. This page is a great way for people to learn about Spark use cases. Since Spark activity has increased a lot in the higher level libraries and people often ask who uses each one, we'll include information about which componen

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Yin Huai
Hello Du, Can you check if there is a dir "metastore" in the place you launching your program. If so, can you delete it and try again? Also, can you try HiveContext? LocalHiveContext is deprecated. Thanks, Yin On Mon, Aug 25, 2014 at 6:33 PM, Du Li wrote: > Hi, > > I created an instance o

Re: Spark webUI - application details page

2014-08-26 Thread SK
I have already tried setting the history server and accessing it on :18080 as per the link. But the page does not list any completed applications. As I mentioned in my previous mail, I am running Spark in standalone mode on the cluster (as well as on my local machine). According to the link, it ap

What is a Block Manager?

2014-08-26 Thread Victor Tso-Guillen
I'm curious not only about what they do, but what their relationship is to the rest of the system. I find that I get listener events for n block managers added where n is also the number of workers I have available to the application. Is this a stable constant? Also, are there ways to determine at

Re: Spark SQL Parser error

2014-08-26 Thread Yin Huai
I have not tried it. But, I guess you need to add your credential in the s3 path. Or, can you copy the jar to your driver node and try again? On Sun, Aug 24, 2014 at 9:35 AM, S Malligarjunan wrote: > Hello Yin, > > Additional note: > In ./bin/spark-shell --jars "s3n:/mybucket/myudf.jar" I got

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
On Tue, Aug 26, 2014 at 10:28 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Maybe it's a Chrome add-on I'm running? Hmm, scratch that. Trying in incognito mode (which disables add-ons, I believe) also yields the same behavior. Nick

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi Bharat, Thanks for your email. If the "Kafka Reader" worker process dies, it will be replaced by different machine, and it will start consuming from the offset where it left over ( for each partition). Same case can happen even if I tried to have individual Receiver for every partition. Regard

Re: Spark Screencast doesn't show in Chrome on OS X

2014-08-26 Thread Nicholas Chammas
For the record, I'm using Chrome 36.0.1985.143 on 10.9.4 as well. Maybe it's a Chrome add-on I'm running? Anyway, as Matei pointed out, if I change the https to http, it works fine. On Tue, Aug 26, 2014 at 1:46 AM, Michael Hausenblas < michael.hausenb...@gmail.com> wrote: > > > https://spark.ap

Re: Low Level Kafka Consumer for Spark

2014-08-26 Thread Dibyendu Bhattacharya
Hi, As I understand, your problem is similar to this JIRA. https://issues.apache.org/jira/browse/SPARK-1647 The issue in this case, Kafka can not replay the message as offsets are already committed. Also I think existing KafkaUtils ( The Default High Level Kafka Consumer) also have this issue.

Re: Spark Streaming Output to DB

2014-08-26 Thread Akhil Das
Yes, you can open a jdbc connection at the beginning of the map method then close this connection at the end of map() and in between you can use this connection. Thanks Best Regards On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma wrote: > Hello People, >> >> I'm using java spark streaming. I'm ju

Prevent too many partitions

2014-08-26 Thread Grzegorz Białek
Hi, I have in my application many union operations. But union increases number of partitions of following RDDs. And performance on more partitions sometimes is very slow. Is there any cleaner way to prevent increasing number of partitions than adding coalesce(numPartitions) after each union? Than

Re: Spark Streaming Output to DB

2014-08-26 Thread Ravi Sharma
> > Hello People, > > I'm using java spark streaming. I'm just wondering, Can I make simple jdbc > connection in JavaDStream map() method? > > Or > > Do I need to create jdbc connection for each JavaPairDStream, after map > task? > > Kindly give your thoughts. > > > Cheers, > Ravi Sharma >

CoGroupedDStream similar to CoGroupedRDD

2014-08-26 Thread Akhil Das
Hi, I have an ArrayList of RDDs (*RDD>*) and I'm able to create a CoGroupedRDD like: CoGroupedRDD coGroupedRDD = new CoGroupedRDD( > (Seq ?>>>)(Object)(JavaConversions.asScalaBuffer(*rddPairs*).toSeq()), > new HashPartitioner(parallelism)); Here *rddPairs* is my

spark.default.parallelism bug?

2014-08-26 Thread Grzegorz Białek
Hi, consider the following code: import org.apache.spark.{SparkContext, SparkConf} object ParallelismBug extends App { var sConf = new SparkConf() .setMaster("spark://hostName:7077") // .setMaster("local[4]") .set("spark.default.parallelism", "7") // or without it val sc = new SparkCo

Re: Storage Handlers in Spark SQL

2014-08-26 Thread chutium
it seems he means to query RDBMS or cassandra using Spark SQL, multi data sources for spark SQL. i looked through the link he posted https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources using their stor

Re: Key-Value in PairRDD

2014-08-26 Thread Sean Owen
I'd suggest first reading the scaladoc for RDD and PairRDDFunctions to familiarize yourself with all the operations available: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunc

Key-Value in PairRDD

2014-08-26 Thread Deep Pradhan
I have the following code *val nodes = lines.map(s =>{val fields = s.split("\\s+") (fields(0),fields(1))}).distinct().groupByKey().cache()* and when I print out the nodes RDD I get the following *(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,

Re: Printing the RDDs in SparkPageRank

2014-08-26 Thread Deep Pradhan
println(parts(0)) does not solve the problem. It does not work On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen wrote: > On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan > wrote: > > When I add > > > > parts(0).collect().foreach(println) > > > > parts(1).collect().foreach(println), for printing parts,

Re: Losing Executors on cluster with RDDs of 100GB

2014-08-26 Thread MEETHU MATHEW
Hi, Plz give a try by changing the worker memory such that worker memory>executor memory   Thanks & Regards, Meethu M On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg wrote: Hi all, I have a spark cluster of 30 machines, 16GB / 8 cores on each running in standalone mode. Previously my

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-26 Thread BertrandR
I actually tried without unpersisting, but given the performance I tryed to add these in order to free the memory. After your anwser I tried to remove them again, but without any change in the execution time... Looking at the web interface, I can see that the "mapPartitions at GraphImpl.scala:184"

Re: Trying to run SparkSQL over Spark Streaming

2014-08-26 Thread praveshjain1991
Thanks for the reply. Ya it doesn't seem doable straight away. Someone suggested this /For each of your streams, first create an emty RDD that you register as a table, obtaining an empty table. For your example, let's say you call it "allTeenagers". Then, for each of your queries, use SchemaRDD'

Re: Pair RDD

2014-08-26 Thread Yanbo Liang
val node = textFile.map(line => { val fileds = line.split("\\s+") (fileds(1),fileds(2)) }) then you can manipulate node RDD with PairRDD function. 2014-08-26 12:55 GMT+08:00 Deep Pradhan : > Hi, > I have an input file of a graph in the format > When I use sc.textFile, it will c

Spark SQL insertInto

2014-08-26 Thread praveshjain1991
I'm using SparkSql for querying. I'm trying something like: val sqc = new SQLContext(sc); import sqc.createSchemaRDD var p1 = Person("Hari",22) val rdd1 = sc.parallelize(Array(p1)) rdd1.registerAsTable("data") var p2 = Person("sagar", 22) var rdd2 = sc.parallelize(Array(p2)) rdd2.insertInto(

Re: Running Wordcount on large file stucks and throws OOM exception

2014-08-26 Thread motte1988
Hello, it's me again. Now I've got an explanation for the behaviour. It seems that the driver memory is not large enough to hold the whole result set of saveAsTextFile In-Memory. And then OOM occures. I test it with a filter-step that removes KV-pairs with WordCount smaller 100,000. So now the job

My Post Related Query

2014-08-26 Thread Sandeep Vaid
I wrote a post on this forum but it shows the message "This post has NOT been accepted by the mailing list yet." above my post. How long will it take to get it posted? Regards, Sandeep Vaid +91 - 09881710301

Re: Request for Help

2014-08-26 Thread Akhil Das
Hi Not sure this is the right way of doing it, but if you can create a PairRDDFunction from that RDD then you can use the following piece of code to access the filenames from the RDD. PairRDDFunctions ds = .; //getting the name and