Re: Measuring Performance in Spark

2014-10-28 Thread Akhil Das
One approach would be to write pure mapReduce and spark jobs (eg like wordcounts, filter, join, groupBy etc) and benchmark them. Another would be to pick something that runs on top of mapReduce/Spark and benchmark on it. (like benchmark against hive and sparkSQL) Thanks Best Regards On Mon, Oct

Re: Spark Worker node accessing Hive metastore

2014-10-28 Thread Akhil
Hi Ken, AFAIK, you can specify the following in the spark-env.sh file across the cluster export HIVE_HOME=/path/to/hive/ export HIVE_CONF_DIR=/path/to/hive/conf/ And it is not necessary for the worker node to have access to the hive's metastore dir. -- View this message in

Re: Support Hive 0.13 .1 in Spark SQL

2014-10-28 Thread Patrick Wendell
Hey Cheng, Right now we aren't using stable API's to communicate with the Hive Metastore. We didn't want to drop support for Hive 0.12 so right now we are using a shim layer to support compiling for 0.12 and 0.13. This is very costly to maintain. If Hive has a stable meta-data API for talking to

Re: Spark Streaming Applications

2014-10-28 Thread Akhil
You can check this project out https://github.com/sigmoidanalytics/spork-streaming/ (it is a bit outdated, but works) It is basically the integration of Pig on SparkStreaming. You can write pig scripts and they are underneath executed as spark streaming job. To get you started quickly, have a

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-28 Thread Sean Owen
It might kind of work, but you are effectively making all of your workers into mini, separate Spark drivers in their own right. This might cause snags down the line as this isn't the normal thing to do. On Tue, Oct 28, 2014 at 12:11 AM, Localhost shell universal.localh...@gmail.com wrote: Hey

Re: Is Spark the right tool?

2014-10-28 Thread Akhil
You can use sparkstreaming to get the transactions from those TCP Connections periodically and you can push the data into HBase accordingly. Now, regarding the querying part, you can use a database like redis which actually does the key, value storing for you. You can use the RDDs to query

Why RDD is not cached?

2014-10-28 Thread shahab
Hi, I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores. I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can seeBlockManagerMasterActor:

Re: Why RDD is not cached?

2014-10-28 Thread Jagat Singh
What setting you are using for persist() or cache() http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence On Tue, Oct 28, 2014 at 6:18 PM, shahab shahab.mok...@gmail.com wrote: Hi, I have a standalone spark , where the executor is set to have 6.3 G memory , as I am

Re: sampling in spark

2014-10-28 Thread Chengi Liu
Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y

sampling in spark

2014-10-28 Thread Chengi Liu
Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y is (mX1) dimension vector and p is (mX1) dimension probability vector. Now, I am trying to sample k rows from X and corresponding entries in y based on probability vector p. Here is the python implementation import randomfrom bisect

Re: Why RDD is not cached?

2014-10-28 Thread Sean Owen
Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached. On Oct 28, 2014 8:19 AM, shahab shahab.mok...@gmail.com wrote: Hi, I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers

Re: sampling in spark

2014-10-28 Thread Davies Liu
_cumm = [p[0]] for i in range(1, len(p)): _cumm.append(_cumm[-1] + p[i]) index = set([bisect(_cumm, random.random()) for i in range(k)]) chosed_x = X.zipWithIndex().filter(lambda (v, i): i in index).map(lambda (v, i): v) chosed_y = [v for i, v

Re: sampling in spark

2014-10-28 Thread Chengi Liu
Is there an equivalent way of doing the following: a = [1,2,3,4] reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:] ?? The issue with above suggestion is that population is a hefty data structure :-/ On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu dav...@databricks.com wrote: _cumm = [p[0]]

Singapore Meetup

2014-10-28 Thread Social Marketing
Dear Sir/Madam, This is Songtao, live in Singapore, doing some research with big data projects in NUS. I want to be an organiser for Singapore Meetup. Thanks. Songao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: Why RDD is not cached?

2014-10-28 Thread shahab
I used Cache followed by a count on RDD to ensure that caching is performed. val rdd = srdd.flatMap(mapProfile_To_Sessions).cache val count = rdd.count //so at this point RDD should be cahed ? right? On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen so...@cloudera.com wrote: Did you just call

Submiting Spark application through code

2014-10-28 Thread sivarani
Hi, i am submitting spark application in the following fashion bin/spark-submit --class NetworkCount --master spark://abc.test.com:7077 try/simple-project/target/simple-project-1.0-jar-with-dependencies.jar But is there any other way to submit spark application through the code? like for

Re: Spark Streaming Applications

2014-10-28 Thread sivarani
Hi tdas, is it possible to run spark 24/7, i am using updateStateByKey and i am streaming 3lac records in 1/2 hr, i am not getting the correct result also i am not not able to run spark streaming for 24/7 after hew hrs i get array out of bound exception even if i am not streaming anything? btw

Re: Spark Streaming - How to remove state for key

2014-10-28 Thread sivarani
I am having the same issue, i am using update stateBykey and over a period a set of data will not change i would like save it and delete it from state.. have you found the answer? please share your views. Thanks for your time -- View this message in context:

Re: Submiting Spark application through code

2014-10-28 Thread Akhil Das
How about directly running it? val ssc = new StreamingContext(local[2],Network WordCount,Seconds(5), /home/akhld/mobi/localclusterxx/spark-1) val lines=ssc.socketTextStream(localhost, 12345) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x,

Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
Hello I am trying to reduce the number of java threads (about 80 on my system) to as few as there can be. What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places as well) I am also using hadoop for storing data on hdfs Thank you, Wanda

Re: How to avoid use snappy compression when saveAsSequenceFile?

2014-10-28 Thread Shixiong Zhu
I mean updating the spark conf not only in the driver, but also in the Spark Workers. Because the driver configurations cannot be read by the Executors, they still use the default spark.io.compression.codec to deserialize the tasks. Best Regards, Shixiong Zhu 2014-10-28 16:39 GMT+08:00 buring

How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
I seem to recall there were some specific requirements on how to import the implicits. Here is the issue: scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:10: error: object RDDFunctions in package rdd cannot be accessed in package org.apache.spark.mllib.rdd import

Re: what classes are needed to register in KryoRegistrator, e.g. Row?

2014-10-28 Thread Fengyun RAO
Although nobody answers, as I tested, Row, MutableValue and there subclasses are not registered by default, which I think should be, since they would absolutely show up in Spark SQL. ​ 2014-10-26 23:43 GMT+08:00 Fengyun RAO raofeng...@gmail.com: In Tuning Spark

Re: Submiting Spark application through code

2014-10-28 Thread sivarani
Hi I know we can create spark context with new JavaStreamingContext(master, appName, batchDuration, sparkHome, jarFile) but to run the application we will have to use spark-home/spark-submit --class NetworkCount i want skip submitting manually, i wanted to invoke this spark app when a

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Prashant Sharma
What is the motivation behind this ? You can start with master as local[NO_OF_THREADS]. Reducing the threads at all other places can have unexpected results. Take a look at this. http://spark.apache.org/docs/latest/configuration.html. Prashant Sharma On Tue, Oct 28, 2014 at 2:08 PM, Wanda

Re: Spark SQL reduce number of java threads

2014-10-28 Thread Wanda Hawk
I am trying to get a software trace and I need to get the number of active threads as low as I can in order to inspect the active part of the workload From: Prashant Sharma scrapco...@gmail.com To: Wanda Hawk wanda_haw...@yahoo.com Cc: user@spark.apache.org

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private class, it can only be called by function in mllib. 2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com: I seem to recall there were some specific requirements on how to import the implicits. Here is the issue:

How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
Hi, I am currently struggling with how to properly set Spark to perform only one map, flatMap, etc at once. In other words my map uses multi core algorithm so I would like to have only one map running to be able to use all the machine cores. Thank you in advance for advices and replies.  Jan 

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Stephen Boesch
HI Yanbo, That is not the issue: notice that importing the object is fine: scala import org.apache.spark.mllib.rdd.RDDFunctions import org.apache.spark.mllib.rdd.RDDFunctions scala import org.apache.spark.mllib.rdd.RDDFunctions._ console:11: error: object RDDFunctions in package rdd cannot be

Re: NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-28 Thread Stephen Boesch
I had an offline with Akhil, but this issue is still not resolved. 2014-10-24 0:18 GMT-07:00 Akhil Das ak...@sigmoidanalytics.com: Make sure the guava jar http://mvnrepository.com/artifact/com.google.guava/guava/12.0 is present in the classpath. Thanks Best Regards On Thu, Oct 23, 2014

SparkSql OutOfMemoryError

2014-10-28 Thread Zhanfeng Huo
Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true spark.shuffle.manager SORT spark.akka.threads 4 spark.sql.inMemoryColumnarStorage.compressed

Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Mars Max
Currently we are using Hive in some products, however, seems maybe Spark SQL is a better choice. Is there any official comparation between them? Thanks a lot! -- View this message in context:

Re: Spark Shell strange worker Exception

2014-10-28 Thread Saket Kumar
Hi Paolo, The custom classes and jars are distributed across the Spark cluster via an HTTP server on the master when the absolute path of the application fat jar is specified in the spark-submit script. The Advanced Dependency Management section on

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Cheng Lian
Which version of Spark and Hadoop are you using? Could you please provide the full stack trace of the exception? On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under

How many executor process does an application receives?

2014-10-28 Thread shahab
Hi, I am running a stand alone Spark cluster, 2 workers each has 2 cores. I submit one Spakr application to the cluster, and I monitor the execution process via UI ( both worker-ip:8081 and master-ip:4040) There I can see that the application is handled by many Executors, in my case one worker

Re: Is There Any Benchmarks Comparing Spark SQL and Hive.

2014-10-28 Thread Yanbo Liang
You can refer the compare between different sql on hadoop solution such as hive, spark sql, shark, impala and so on. There are two main works which may be not very objectively, for your reference: Cloudera benchmark:

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
Hi, I got the following exceptions when using Spray client to write to OpenTSDB using its REST API. Exception in thread pool-10-thread-2 java.lang.NoSuchMethodError: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext; It worked locally in my Intellij but failed when I

newbie question quickstart example sbt issue

2014-10-28 Thread nl19856
Hi, I have downloaded the binary spark distribution. When building the package with sbt package I get the following: [root@nlvora157 ~]# sbt package [info] Set current project to Simple Project (in build file:/root/) [info] Updating {file:/root/}root... [info] Resolving

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Yanbo Liang
Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 hanspeter.sl...@gmail.com: Hi, I have downloaded the binary spark distribution. When building the package with sbt package I get the following: [root@nlvora157 ~]# sbt package [info] Set current project to

Re: newbie question quickstart example sbt issue

2014-10-28 Thread nl19856
Sigh! Sorry I did not read the error message properly. 2014-10-28 11:39 GMT+01:00 Yanbo Liang [via Apache Spark User List] ml-node+s1001560n17478...@n3.nabble.com: Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 [hidden email]

Re: newbie question quickstart example sbt issue

2014-10-28 Thread Akhil Das
Your proxy/dns could be blocking it. Thanks Best Regards On Tue, Oct 28, 2014 at 4:06 PM, Yanbo Liang yanboha...@gmail.com wrote: Maybe you had wrong configuration of sbt proxy. 2014-10-28 18:27 GMT+08:00 nl19856 hanspeter.sl...@gmail.com: Hi, I have downloaded the binary spark

Re: NoSuchMethodError: cassandra.thrift.ITransportFactory.openTransport()

2014-10-28 Thread Sasi
Add my message. On Tue, Oct 28, 2014 at 3:22 PM, Sasi [via Apache Spark User List] ml-node+s1001560n17471...@n3.nabble.com wrote: Thank you Akhil. You are correct it's about overlapped thrift libraries. We have taken reference from

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
The number of tasks is decided by the input partition numbers. If you want only one map or flatMap at once, just call coalesce() or repartition() to associate data into one partition. However, this is not recommend because it was not executed parallel efficiently. 2014-10-28 17:27 GMT+08:00

Re: SparkSql OutOfMemoryError

2014-10-28 Thread Yanbo Liang
Try to increase the driver memory. 2014-10-28 17:33 GMT+08:00 Zhanfeng Huo huozhanf...@gmail.com: Hi,friends: I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when data is large. So how to tune it ? spark-defaults.conf: spark.shuffle.consolidateFiles true

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
Hi Arpit, To try this: val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions = numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK) Best, Yifan LI On 28 Oct 2014, at 11:17, Arpit Kumar arp8...@gmail.com

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread jan.zikes
But I guess that this makes only one task over all the clusters nodes. I would like to run several tasks, but I would like Spark to not run more than one map at each of my nodes at one time. That means I would like to let's say have 4 different tasks and 2 nodes where each node has 2 cores.

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Arpit Kumar
Hi Yifan LI, I am currently working on Spark 1.0 in which we can't pass edgeStorageLevel as parameter. It implicitly caches the edges. So I am looking for a workaround. http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.graphx.GraphLoader$ Regards, Arpit On Tue, Oct 28,

sbt error building spark : [FATAL] Non-resolvable parent POM:

2014-10-28 Thread nl19856
Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Everything seems to go smooth until : [info] downloading https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar ... [info] [SUCCESSFUL ]

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-28 Thread Yanbo Liang
It's not very difficult to implement by properly set parameter of application. Some basic knowledge you should know: An application can have only one executor at each machine or container (YARN). So you just set executor-cores as 1, then each executor will make only one task at once. 2014-10-28

Suitability for spark for master worker distributed patterns...

2014-10-28 Thread Sasha Kacanski
Hi, Did anyone tried to replace gigaspaces implementation of master worker with spark standalone or hadoop driven implementation... I guess I am looking to find out what are pros and cons and if people tried it on the production side (grid or hadoop) Regards, -- Aleksandar Kacanski

Re: How many executor process does an application receives?

2014-10-28 Thread Yanbo Liang
An application can have only one executor at each machine or container (YARN). How many thread that each executor have is determined by the parameter executor-cores. There are also other parameter setting method that you can specify total- executor-cores and each executor cores will be determined

Re: How to import mllib.rdd.RDDFunctions into the spark-shell

2014-10-28 Thread Yanbo Liang
Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not use any method in this class or even new an object of this class. So I infer that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may call some method of that object. 2014-10-28 17:29 GMT+08:00 Stephen Boesch

Re: Batch of updates

2014-10-28 Thread Kamal Banga
Hi Flavio, Doing batch += ... shouldn't work. It will create new batch for each element in the myRDD (also val initializes an immutable variable, var is for mutable variables). You can use something like accumulators http://spark.apache.org/docs/latest/programming-guide.html#accumulators. val

Re: How to set persistence level of graph in GraphX in spark 1.0.0

2014-10-28 Thread Yifan LI
I am not sure if it can work on Spark 1.0, but give it a try. or, Maybe you can try: 1) to construct the edges and vertices RDDs respectively with desired storage level. 2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD). Best, Yifan LI On 28 Oct 2014, at 12:10, Arpit Kumar

GraphX StackOverflowError

2014-10-28 Thread Zuhair Khayyat
Dear All, I am using connected components function of GraphX (on Spark 1.0.2) on some graph. However for some reason the fails with StackOverflowError. The graph is not too big; it contains 1 vertices and 50 edges. Can any one help me to avoid this error? Below is the output of Spark:

Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread Wanda Hawk
Is this what are you looking for ? In Shark, default reducer number is 1 and is controlled by the property mapred.reduce.tasks. Spark SQL deprecates this property in favor ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize this property via SET: SET

Deploying Spark on Stand alone cluster

2014-10-28 Thread TravisJ
I am trying to setup Apache-Spark on a small standalone cluster (1 Master Node and 8 Slave Nodes). I have installed the pre-built version of spark 1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between nodes and exported a few necessary environment variables. One of these

Re: Is Spark the right tool?

2014-10-28 Thread Koert Kuipers
spark can definitely very quickly answer queries like give me all transactions with property x. and you can put a http query server in front of it and run queries concurrently. but spark does not support inserts, updates, or fast random access lookups. this is because RDDs are immutable and

Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread shahab
Thanks for the useful comment. But I guess this setting applies only when I use SparkSQL right= is there any similar settings for Spark? best, /Shahab On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wanda_haw...@yahoo.com wrote: Is this what are you looking for ? In Shark, default reducer

Re: What executes on worker and what executes on driver side

2014-10-28 Thread Kamal Banga
Can you please elaborate, I didn't get what you intended for me to read in that link. Regards. On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan saurabh.wadha...@guavus.com wrote: What about:

Streaming window operations not producing output

2014-10-28 Thread diogo
Hi there, I'm trying to use Window operations on streaming, but everything I perform a windowed computation, I stop getting results. For example: val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() Will print the output to the stdout on 'batch duration' interval. Now if I replace it

Ending a job early

2014-10-28 Thread Jim Carroll
We have some very large datasets where the calculation converge on a result. Our current implementation allows us to track how quickly the calculations are converging and end the processing early. This can significantly speed up some of our processing. Is there a way to do the same thing is

pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Csaba Ragany
Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I can create a parallelized file with rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile with rdd.saveAsSequenceFile(). But how can I put the whole

Re: Measuring Performance in Spark

2014-10-28 Thread mahsa
Thanks Akhil, So there is no tool that I can use right? My program is overloading some operators for some operation on images. I need to be accurate in the result. I try to work on your offered approach. Thanks. -- View this message in context:

install sbt

2014-10-28 Thread Pagliari, Roberto
Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,

java.lang.IllegalArgumentException: requirement failed: sizeInBytes was negative: -9223372036842471144

2014-10-28 Thread Ruebenacker, Oliver A
Hello, I have a Spark app which I run with master local[3]. When running without any persist calls, it seems to work fine, but as soon as I add persist calls (at default storage level), it fails at the first persist call with the message below. Unfortunately, I can't post the code.

Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Brett Antonides
Hello, Given the following example customers.json file: { name: Sherlock Holmes, customerNumber: 12345, address: { street: 221b Baker Street, city: London, zipcode: NW1 6XE, country: United Kingdom } }, { name: Big Bird, customerNumber: 10001, address: { street: 123 Sesame Street, city:

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin
Hi all - I've simplified the code so now I'm literally feeding in 200 million ratings directly to ALS.train. Nothing else is happening in the program. I've also tried with both the regular serializer and the KryoSerializer. With Kryo, I get the same ArrayIndex exceptions. With the regular

Re: Batch of updates

2014-10-28 Thread Sean Owen
You should use foreachPartition, and take care to open and close your connection following the pattern described in: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E Within a partition, you iterate over

Re: install sbt

2014-10-28 Thread Ted Yu
Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto rpagli...@appcomsci.com wrote: Is there a repo or some kind of instruction about how to install sbt for centos? Thanks,

Re: install sbt

2014-10-28 Thread Nicholas Chammas
If you're just calling sbt from within the spark/sbt folder, it should download and install automatically. Nick 2014년 10월 28일 화요일, Ted Yuyuzhih...@gmail.com님이 작성한 메시지: Have you read this ? http://lancegatlin.org/tech/centos-6-install-sbt On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto

Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled

Saving to Cassandra from Spark Streaming

2014-10-28 Thread Harold Nguyen
Hi all, I'm having trouble troubleshooting this particular block of code for Spark Streaming and saving to Cassandra: val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER) val words = lines.flatMap(_.split( )) val wordCounts = words.map(x = (x,

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub
The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDDInteger rdd = new JdbcRDDInteger(sp, () - ods.getConnection(), sql,

Re: Saving to Cassandra from Spark Streaming

2014-10-28 Thread Gerard Maas
Looks like you're having some classpath issues. Are you providing your spark-cassandra-driver classes to your job? sparkConf.setJars(Seq(jars...)) ? On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'm having trouble troubleshooting this particular block of

Re: pySpark - convert log/txt files into sequenceFile

2014-10-28 Thread Holden Karau
Hi Csaba, It sounds like the API you are looking for is sc.wholeTextFiles :) Cheers, Holden :) On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote: Dear Spark Community, Is it possible to convert text files (.log or .txt files) into sequencefiles in Python? Using PySpark I

Re: How can number of partitions be set in spark-env.sh?

2014-10-28 Thread Ilya Ganelin
In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, shahab shahab.mok...@gmail.com

Re: Scala Spark IDE help

2014-10-28 Thread Matt Narrell
So, Im using Intellij 13.x, and Scala Spark jobs. Make sure you have singletons (objects, not classes), then simply debug the main function. You’ll need to set your master to some derivation of “local”, but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest behaves

Re: Scala Spark IDE help

2014-10-28 Thread andy petrella
Also, I'm following to master students at the University of Liège (one for computing prob conditional density on massive data and the other implementing a Markov Chain method on georasters), I proposed them to use the Spark-Notebook to learn the framework, they're quite happy with it (so far at

Re: Keep state inside map function

2014-10-28 Thread Koert Kuipers
doing cleanup in an iterator like that assumes the iterator always gets fully read, which is not necessary the case (for example RDD.take does not). instead i would use mapPartitionsWithContext, in which case you can write a function of the form. f: (TaskContext, Iterator[T]) = Iterator[U] now

Re: Spark to eliminate full-table scan latency

2014-10-28 Thread Matt Narrell
I’ve been puzzled by this lately. I too would like to use the thrift server to provide JDBC style access to datasets via SparkSQL. Is this possible? The examples show temp tables created during the lifetime of a SparkContext. I assume I can use SparkSQL to query those tables while the

Re: Submiting Spark application through code

2014-10-28 Thread Matt Narrell
Can this be done? Can I just spin up a SparkContext programmatically, point this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least) the application JAR need to be distributed to the workers via HDFS or the like for the jobs to run? mn On Oct 28, 2014, at 2:29 AM,

real-time streaming

2014-10-28 Thread ll
the spark tutorial shows that we can create a stream that reads new files from a directory. that seems to have some lag time, as we have to write the data to file first and then wait until spark stream picks it up. what is the best way to implement REAL 'REAL-TIME' streaming for analysis in

Re: real-time streaming

2014-10-28 Thread jay vyas
a REAL TIME stream, by definition, delivers data every X seconds. you can easily do this with spark. roughly here is the way to create a stream gobbler and attach a spark app to read its data every X seconds - Write a Runnable thread which reads data from a source. Test that it works

Re: real-time streaming

2014-10-28 Thread ll
thanks jay. do you think spark is a good fit for handling streaming analyzing videos in real time? in this case, we're streaming 30 frames per second, and each frame is an image (size: roughly 500K - 1MB). we need to analyze every frame and return the analysis result back instantly in real

Re: JdbcRDD in Java

2014-10-28 Thread Sean Owen
That declaration looks OK for Java 8, at least when I tried it just now vs master. The only thing I see wrong here is getInt throws an exception which means the lambda has to be more complicated than this. This is Java code here calling the constructor so yes it can work fine from Java (8). On

Re: Spark Streaming and Storm

2014-10-28 Thread critikaled
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-Storm-tp9118p17530.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
Here's the answer I got from Akka's user ML. This looks like a binary incompatibility issue. As far as I know Spark is using a custom built Akka and Scala for various reasons. You should ask this on the Spark mailing list, Akka is binary compatible between major versions (2.3.6 is compatible

Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Harold Nguyen
Hi all, The following works fine when submitting dependency jars through Spark-Shell: ./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
Hey Jim, There are some experimental (unstable) API's that support running jobs which might short-circuit: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126 This can be used for doing online aggregations like you are describing. And in one

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
I am using Hadoop 2.5.0.3 and spark 1.1. My local hive version is 0.12.3 the hcatalog.jar of which is included in the path. The stack trace is as follows: 14/10/28 18:24:24 WARN ipc.Client: Exception encountered while connecting to the server :

Re: RDD to Multiple Tables SparkSQL

2014-10-28 Thread critikaled
Hi oliver, thanks for the answer I don't have the information of all keys before hand, the reason i want to have multiple tables is that based on my information on known key I will apply different queries get the results for that particular key I don't want to touch the unkown ones I'll save that

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
If I put all the jar files from my local hive in the front of the spark class path, a different error was reported, as follows: 14/10/28 18:29:40 ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: PLAIN auth failed: null at

Re: Is Spark in Java a bad idea?

2014-10-28 Thread critikaled
Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;).

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Including jars in Spark-shell vs Spark-submit

2014-10-28 Thread Helena Edelson
Hi Harold, It seems like, based on your previous post, you are using one version of the connector as a dependency yet building the assembly jar from master? You were using 1.1.0-alpha3 (you can upgrade to alpha4, beta coming this week) yet your assembly is

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has

RE: Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not

  1   2   >