Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over-determined, while the convergence rate depends on the condition number. Applying standard scaling is popular heuristic to reduce the condition number. If you are interested in

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of spark jar. But I'm not very certain that it works as expected (and this is why it is undocumented). Please try turning on spark.yarn.user.classpath.first . -Xiangrui On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen

Re: error: type mismatch while Union

2014-09-08 Thread Dhimant
Thank you Aaron for pointing out problem. This only happens when I run this code in spark-shell but not when i submit the job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html Sent from the Apache Spark User

Re: How to list all registered tables in a sql context?

2014-09-08 Thread Jianshi Huang
Thanks Tobias, I also found this: https://issues.apache.org/jira/browse/SPARK-3299 Looks like it's been working on. Jianshi On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Err...

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Frank Austin Nothaft
Tomer, To use distcp, you need to have a Hadoop compute cluster up. start-dfs just restarts HDFS. I don’t have a Spark 1.0.2 cluster up right now, but there should be a start-mapred*.sh or start-all.sh script that will launch the Hadoop MapReduce cluster that you will need for distcp.

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-08 Thread Sean Owen
That should be OK, since the iterator is definitely consumed, and therefore the connection actually done with, at the end of a 'foreach' method. You might put the close in a finally block. On Mon, Sep 8, 2014 at 12:29 AM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I have the following code:

Re: Spark Streaming and database access (e.g. MySQL)

2014-09-08 Thread Tobias Pfeiffer
Hi, On Mon, Sep 8, 2014 at 4:39 PM, Sean Owen so...@cloudera.com wrote: if (rdd.take (1).size == 1) { rdd foreachPartition { iterator = I was wondering: Since take() is an output operation, isn't it computed twice (once for the take(1), once during the

sharing off_heap rdds

2014-09-08 Thread Manku Timma
I see that the tachyon url constructed for an rdd partition has executor id in it. So if the same partition is being processed by a different executor on a reexecution of the same computation, it cannot really use the earlier result. Is this a correct assessment? Will removing the executor id from

How to profile a spark application

2014-09-08 Thread rapelly kartheek
Hi, Can someone tell me how to profile a spark application. -Karthik

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread jamborta
thank you for the replies. I am running an insert on a join (INSERT OVERWRITE TABLE new_table select * from table1 as a join table2 as b on (a.key = b.key), The process does not have the right permission to write to that folder, so I get the following error printed: chgrp: `/user/x/y': No such

Re: How to profile a spark application

2014-09-08 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, Can someone tell me how to profile a spark application. -Karthik

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-08 Thread Ognen Duzlevski
Solved. The problem is the following: the underlying Akka driver uses the INTERNAL interface address on the Amazon instance (the ones that start with 10.x.y.z) to present itself to the master, it does not use the external (public) IP! Ognen On 9/7/2014 3:21 PM, Sean Owen wrote: Also keep

Re: Standalone spark cluster. Can't submit job programmatically - java.io.InvalidClassException

2014-09-08 Thread DrKhu
After wasting a lot of time, I've found the problem. Despite I haven't used hadoop/hdfs in my application, hadoop client matters. The problem was in hadoop-client version, it was different than the version of hadoop, spark was built for. Spark's hadoop version 1.2.1, but in my application that was

spark application in cluster mode doesn't run correctly

2014-09-08 Thread 남윤민
Hello, I tried to execute a simple spark application using sparkSQL. At first try, it worked as I exepcted but after then, it doesn't run and shows an stderr like below: Spark Executor Command: java -cp

Error while running sparkSQL application in the cluster-mode environment

2014-09-08 Thread 남윤민
Hello, I tried to execute a simple spark application using sparkSQL. At first try, it worked as I exepcted but after then, it doesn't run and shows an stderr like below: Spark Executor Command: java -cp

How to scale large kafka topic

2014-09-08 Thread richiesgr
Hi I'm building a application the read from kafka stream event. In production we've 5 consumers that share 10 partitions. But on spark streaming kafka the master act as a consumer then distribute the tasks to workers so I can have only 1 masters acting as consumer but I need more because only 1

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark since I was comparing ip with proximal algorithms but we can expose suitesparse api as well...jni is used to load up ldl amd and ecos

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Xiangrui, Should I open up a JIRA for this ? Distributed lp/socp solver through ecos/ldl/amd ? I can open source it with gpl license in spark code as that's what our legal cleared (apache + gpl becomes gpl) and figure out the right way to call it...ecos is gpl but we can definitely use the jni

Cannot run SimpleApp as regular Java app

2014-09-08 Thread ericacm
Dear all: I am a brand new Spark user trying out the SimpleApp from the Quick Start page. Here is the code: object SimpleApp { def main(args: Array[String]) { val logFile = /dev/spark-1.0.2-bin-hadoop2/README.md // Should be some file on your system val conf = new SparkConf()

Spark SQL on Cassandra

2014-09-08 Thread gtinside
Hi , I am reading data from Cassandra through datastax spark-cassandra connector converting it into JSON and then running spark-sql on it. Refer to the code snippet below : step 1 val o_rdd = sc.cassandraTable[CassandraRDDWrapper]( 'keyspace', 'column_family') step 2 val tempObjectRDD =

A problem for running MLLIB in amazon clound

2014-09-08 Thread Hui Li
I am running a very simple example using the SVMWithSGD on Amazon EMR. I haven't got any result after one hour long. My instance-type is: m3.large instance-count is: 3 Dataset is the data provided by the MLLIB in apache: sample_svm_data The number of iteration is: 2 and all other options

groupBy gives non deterministic results

2014-09-08 Thread redocpot
Hi, I have a key-value RDD called rdd below. After a groupBy, I tried to count rows. But the result is not unique, somehow non deterministic. Here is the test code: val step1 = ligneReceipt_cleTable.persist val step2 = step1.groupByKey val s1size = step1.count val s2size =

Re: How to profile a spark application

2014-09-08 Thread rapelly kartheek
Thank you Ted. regards Karthik On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote: See https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, Can someone

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Nicholas Chammas
Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too http://stackoverflow.com/a/18083790/877069. Nick ​ On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too. Nick On Mon, Sep

Re: groupBy gives non deterministic results

2014-09-08 Thread redocpot
Update: Just test with HashPartitioner(8) and count on each partition: List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657591*), (*6,658327*), (*7,658434*)), List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394), *(5,657594)*, (6,658326), (*7,658434*)),

How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved? val values: Future[RDD[Double]] = Future sequence tasks I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Jörn Franke
Hi, I What does the external service provide? Data? Calculations? Can the service push data to you via Kafka and Spark streaming ? Can you fetch the necessary data beforehand from the service? The solution to your question depends on your answers. I would not recommend to connect to a blocking

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
Hi, Jörn, first of all, thanks for you intent to help. This one external service is a native component, that is stateless and that performs the calculation based on the data I provide. The data is in RDD. That one component I have on each worker node and I would like to get as much parallelism

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Sean Owen
What is the driver-side Future for? Are you trying to make the remote Spark workers execute more requests to your service concurrently? it's not clear from your messages whether it's something like a web service, or just local native code. So the time spent in your processing -- whatever returns

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
No tasktracker or nodemanager. This is what I see: On the master: org.apache.hadoop.yarn.server.resourcemanager.ResourceManager org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode org.apache.hadoop.hdfs.server.namenode.NameNode On the data node (slave):

If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Dimension Data, LLC.
Hello friends: It was mentioned in another (Y.A.R.N.-centric) email thread that 'SPARK_JAR' was deprecated, and to use the 'spark.yarn.jar' property instead for YARN submission. For example: user$ pyspark [some-options] --driver-java-options

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ pyspark [some-options] --driver-java-options spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar This command line does not look correct. spark.yarn.jar is not a JVM command line option.

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread DrKhu
Thanks, Sean, I'll try to explain, what I'm trying to do. The native component, that I'm talking about is the native code, that I call using JNI. I've wrote small test Here, I traverse through the collection to call the native component N (1000) times. Then I have a result it means, that

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-08 Thread Matt Narrell
I came across this: https://github.com/xerial/sbt-pack Until i found this, I was simply using the sbt-assembly plugin (sbt clean assembly) mn On Sep 4, 2014, at 2:46 PM, Aris arisofala...@gmail.com wrote: Thanks for answering Daniil - I have SBT version 0.13.5, is that an old version?

Spark-submit ClassNotFoundException with JAR!

2014-09-08 Thread Peter Aberline
Hi, I'm having problems with a ClassNotFoundException using this simple example: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import java.net.URLClassLoader import scala.util.Marshal class ClassToRoundTrip(val id: Int) extends

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. subscripti...@didata.us wrote: user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads. user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar=' /usr/lib/spark/assembly/lib/spark-assembly-*.jar' My question is,

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
I don't understand what you mean. Can you be more specific? From: Victor Tso-Guillen v...@paxata.com Sent: Saturday, September 06, 2014 5:13 PM To: Penny Espinoza Cc: Spark Subject: Re: prepending jars to the driver class path for spark-submit on YARN I ran

Re: How do you perform blocking IO in apache spark job?

2014-09-08 Thread Jörn Franke
Hi, So the external service itself creates threads and blocks until they finished execution? In this case you should not do threading but include it via jni directly in spark - it will take care about threading for you. Vest regards Hi, Jörn, first of all, thanks for you intent to help. This

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the wrong value of mapreduce.jobtracker.address cause the slave node cannot start the node manager. ( I am not familiar with the ec2 script, so I don't know whether the slave node has node manager installed or not.) Can

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
When you submit the job to yarn with spark-submit, set --conf spark.yarn.user.classpath.first=true . On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza pesp...@societyconsulting.com wrote: I don't understand what you mean. Can you be more specific? From: Victor

Re: A problem for running MLLIB in amazon clound

2014-09-08 Thread Xiangrui Meng
Could you attach the driver log? -Xiangrui On Mon, Sep 8, 2014 at 7:23 AM, Hui Li hli161...@gmail.com wrote: I am running a very simple example using the SVMWithSGD on Amazon EMR. I haven't got any result after one hour long. My instance-type is: m3.large instance-count is: 3 Dataset

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
I have tried using the spark.files.userClassPathFirst option (which, incidentally, is documented now, but marked as experimental), but it just causes different errors. I am using spark-streaming-kafka. If I mark spark-core and spark-streaming as provided and also exclude them from the

Re: Crawler and Scraper with different priorities

2014-09-08 Thread Daniil Osipov
Depending on what you want to do with the result of the scraping, Spark may not be the best framework for your use case. Take a look at a general Akka application. On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh sand...@techaddict.me wrote: Hi all, I am Implementing a Crawler, Scraper. The It

Input Field in Spark 1.1 Web UI

2014-09-08 Thread Arun Ahuja
Is there more information on what the Input column on the Spark UI means? How is this computed? I am processing a fairly small (but zipped) file and see the value as [image: Inline image 1] This does not seem correct? Thanks, Arun

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
I asked Tim whether he would change the license of SuiteSparse to an Apache-friendly license couple months ago, but the answer was no. So I don't think we can use SuiteSparse in MLlib through JNI. Please feel free to create JIRAs for distributed linear programming and SOCP solvers and run the

RE: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Penny Espinoza
?VIctor - Not sure what you mean. Can you provide more detail about what you did? From: Victor Tso-Guillen v...@paxata.com Sent: Saturday, September 06, 2014 5:13 PM To: Penny Espinoza Cc: Spark Subject: Re: prepending jars to the driver class path for

saveAsHadoopFile into avro format

2014-09-08 Thread Dariusz Kobylarz
What is the right way of saving any PairRDD into avro output format. GraphArray extends SpecificRecord etc. I have the following java rdd: JavaPairRDDGraphArray, NullWritable pairRDD = ... and want to save it to avro format: org.apache.hadoop.mapred.JobConf jc = new

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. subscripti...@didata.us wrote: So just to clarify for me: When specifying 'spark.yarn.jar' as I did above, even if I don't use HDFS to create a RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is still necessary to

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Yup...this can be a spark community project...I saw a PR for that...interested users fine with lgpl/gpl code can make use of it... On Mon, Sep 8, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote: I asked Tim whether he would change the license of SuiteSparse to an Apache-friendly license

Records - Input Byte

2014-09-08 Thread danilopds
Hi, I was reading the paper of Spark Streaming: Discretized Streams: Fault-Tolerant Streaming Computation at Scale So, I read that performance evaluation used 100-byte input records in test Grep and WordCount. I don't have much experience and I'd like to know how can I control this value in my

Recommendations for performance

2014-09-08 Thread Manu Mukerji
Hi, Let me start with, I am new to spark.(be gentle) I have a large data set in Parquet (~1.5B rows, 900 columns) Currently Impala takes ~1-2 seconds for the queries while SparkSQL is taking ~30 seconds.. Here is what I am currently doing.. I launch with SPARK_MEM=6g spark-shell val

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread Michael Armbrust
You are probably not getting an error because the exception is happening inside of Hive. I'd still consider this a bug if you'd like to open a JIRA. On Mon, Sep 8, 2014 at 3:02 AM, jamborta jambo...@gmail.com wrote: thank you for the replies. I am running an insert on a join (INSERT

Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Jim Carroll
Hello all, I've been wrestling with this problem all day and any suggestions would be greatly appreciated. I'm trying to test reading a parquet file that's stored in s3 using a spark cluster deployed on ec2. The following works in the spark shell when run completely locally on my own machine

Re: Spark SQL on Cassandra

2014-09-08 Thread Michael Armbrust
I believe DataStax is working on better integration here, but until that is ready you can use the applySchema API. Basically you will convert the CassandraTable into and RDD of Row objects using a .map() and then you can call applySchema (provided by SQLContext) to get a SchemaRDD. More details

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Manu Mukerji
How big is the data set? Does it work when you copy it to hdfs? -Manu On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll jimfcarr...@gmail.com wrote: Hello all, I've been wrestling with this problem all day and any suggestions would be greatly appreciated. I'm trying to test reading a parquet

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-08 Thread chutium
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks it will be great, if something like hive.exec.reducers.bytes.per.reducer could be implemented. one idea is, get total size of all target blocks, then set number of partitions -- View this message in

Re: Low Level Kafka Consumer for Spark

2014-09-08 Thread Tim Smith
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't the right way. You have to /val partedStream = repartition(...)/. Would be nice to have it fixed in the docs. On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Some thoughts on this

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
Mmm how many days worth of data/how deep is your data nesting? I suspect your running into a current issue with parquet (a fix is in master but I don't believe released yet..). It reads all the metadata to the submitter node as part of scheduling the job. This can cause long start times(timeouts

[Spark Streaming] java.lang.OutOfMemoryError: GC overhead limit exceeded

2014-09-08 Thread Yan Fang
Hi guys, My Spark Streaming application have this java.lang.OutOfMemoryError: GC overhead limit exceeded error in SparkStreaming driver program. I have done the following to debug with it: 1. improved the driver memory from 1GB to 2GB, this error came after 22 hrs. When the memory was 1GB, it

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

2014-09-08 Thread Marcelo Vanzin
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. subscripti...@didata.us wrote: You're probably right about the above because, as seen *below* for pyspark (but probably for other Spark applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is specified, the app invocation

Spark Web UI in Mesos mode

2014-09-08 Thread SK
Hi, I am running Spark 1.0.2 on a cluster in Mesos mode. I am not able to access the Spark master Web UI at port 8080 but am able to access it at port 5050. Is 5050 the standard port? Also, in the the standalone mode, there is a link to the Application detail UI directly from the master UI. I

Re: Spark Web UI in Mesos mode

2014-09-08 Thread Wonha Ryu
Hi, Spark master web UI is only for standalone clusters, where cluster resources are managed by Spark, not other resource managers. Mesos master's default port is 5050. Within Mesos, a Spark application is considered as one of many frameworks, so there's no Spark-specific support like accessing

Executor address issue: CANNOT FIND ADDRESS (Spark 0.9.1)

2014-09-08 Thread Nicolas Mai
Hi, One of the executors in my spark cluster shows a CANNOT FIND ADDRESS address, for one of the stages which failed. After that stages, I got cascading failures for all my stages :/ (stages that seem complete but still appears as active stage in the dashboard; incomplete or failed stages that are

Is the structure for a jar file for running Spark applications the same as that for Hadoop

2014-09-08 Thread Steve Lewis
In a Hadoop jar there is a directory called lib and all non-provided third party jars go there and are included in the class path of the code. Do jars for Spark have the same structure - another way to ask the question is if I have code to execute Spark and a jar build for Hadoop can I simply use

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-08 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com wrote: In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Ron, On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: I’m trying to figure out how I can run Spark Streaming like an API. The goal is to have a synchronous REST API that runs the spark data flow on YARN. I guess I *may* develop something similar in the

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
Hi Hemanth, I think there is a bug in this API in Spark 0.8.1, so you will meet this exception when using Java code with this API, this bug is fixed in latest version, as you can see the patch (https://github.com/apache/spark/pull/1508). But it’s only for Kafka 0.8+, as you still use kafka

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote: I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from

Re: How to profile a spark application

2014-09-08 Thread rapelly kartheek
hi Ted, Where do I find the licence keys that I need to copy to the licences directory. Thank you!! On Mon, Sep 8, 2014 at 8:25 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Thank you Ted. regards Karthik On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote: See

Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Hi Tobias, So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously. For example, let’s say there’s a particular topic T1 in a Kafka queue. If I have a new set of requests coming from a

Re: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Hemanth Yamijala
Thanks, Shao, for providing the necessary information. Hemanth On Tue, Sep 9, 2014 at 8:21 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi Hemanth, I think there is a bug in this API in Spark 0.8.1, so you will meet this exception when using Java code with this API, this bug is fixed in

Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi, On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote: So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously. No, with the current state of Spark Streaming, all data

Re: Crawler and Scraper with different priorities

2014-09-08 Thread Sandeep Singh
Hi Daniil, I have to do some processing of the results, as well as pushing the data to the front end. Currently I'm using akka for this application, but I was thinking maybe spark streaming would be a better thing to do. as well as i can use mllib for processing the results. Any specific reason's

RE: Setting Kafka parameters in Spark Streaming

2014-09-08 Thread Shao, Saisai
As you mentioned you hope to transplant latest version of Spark into Kafka 0.7 in another mail, there are some notes you should take care: 1. Kafka 0.7+ can only be compiled with Scala 2.8, while now Spark is compiled with Scala 2.10, there is no binary compatible between these two Scala