Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Davies Liu
There is a PR to fix this: https://github.com/apache/spark/pull/1802 On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu wrote: I concur that printSchema works; it just seems to be operations that use the data where trouble happens. Thanks for posting the bug. -Brad

Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Hi, I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL without problems: val input_file = s3://bucket-name/test_data.txt val rawdata = sc.textFile(input_file) val test = rawdata.collect but when I try to run a simple standalone application reading the same data, I

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
I'm getting the same Input path does not exist error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format s3://bucket-name/test_data.txt for the input file. -- View this message in context:

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Evan Sparks
Try s3n:// On Aug 6, 2014, at 12:22 AM, sparkuser2345 hm.spark.u...@gmail.com wrote: I'm getting the same Input path does not exist error also after setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables and using the format s3://bucket-name/test_data.txt for the

Re: Setting spark.executor.memory problem

2014-08-06 Thread Grzegorz Białek
Hi Andrew, Thank you very much for your solution, it works like a charm, and for very clear explanation. Grzegorz

fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Lizhengbing (bing, BIPA)
1 I don't use spark_submit to run my problem and use spark context directly val conf = new SparkConf() .setMaster(spark://123d101suse11sp3:7077) .setAppName(LBFGS) .set(spark.executor.memory, 30g) .set(spark.akka.frameSize,20) val sc = new

Spark GraphX remembering old message

2014-08-06 Thread Arun Kumar
Hi I am trying to find belief for a graph using GraphX Pragel implementation .My use case is like if vertex 2,3,4 are sending message m2,m3,m4 to vertex 6 .In vertex 6 I will multiple all the messages (m2*m3*m4) =m6 and then from vertex6 the message (m6/m2) will be send to vertex 2,m6/m3 to

can't submit my application on standalone spark cluster

2014-08-06 Thread Andres Gomez Ferrer
Hi all, My name is Andres and I'm starting to use Apache Spark. I try to submit my spark.jar to my cluster using this: spark-submit --class net.redborder.spark.RedBorderApplication --master spark://pablo02:7077 redborder-spark-selfcontained.jar But when I did it .. My worker die .. and my

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Akhil Das
Looks like a netty conflict there, most likely you are having mutiple versions of netty jars (eg: netty-3.6.6.Final.jar, netty-3.2.2.Final.jar, netty-all-4.0.13.Final.jar), you only require 3.6.6 i believe. a quick fix would be to remove the rest of them. Thanks Best Regards On Wed, Aug 6, 2014

Re: Problem reading from S3 in standalone application

2014-08-06 Thread sparkuser2345
Evan R. Sparks wrote Try s3n:// Thanks, that works! In REPL, I can succesfully load the data using both s3:// and s3n://, why the difference? -- View this message in context:

Spark SQL (version 1.1.0-SNAPSHOT) should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang
Spark reported error java.lang.IllegalArgumentException with messages: java.lang.IllegalArgumentException: requirement failed: Found fields with the same name. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.catalyst.types.StructType.init(dataTypes.scala:317)

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-06 Thread Mahebub Sayyed
Hello, I have referred link https://github.com/dibbhatt/kafka-spark-consumer; and I have successfully consumed tuples from kafka. Tuples are JSON objects and I want to store that objects in HDFS as parque format. Please suggest me any sample example for that. Thanks in advance. On Tue, Aug

Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my

Re: Problem reading from S3 in standalone application

2014-08-06 Thread Nicholas Chammas
See here: https://wiki.apache.org/hadoop/AmazonS3 s3:// refers to a block storage system and is deprecated http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html. Use s3n:// for regular files you can see in the S3 web console. Nick On Wed, Aug 6, 2014 at

[Streaming] updateStateByKey trouble

2014-08-06 Thread Yana Kadiyska
Hi folks, hoping someone who works with Streaming can help me out. I have the following snippet: val stateDstream = data.map(x = (x, 1)) .updateStateByKey[State](updateFunc) stateDstream.saveAsTextFiles(checkpointDirectory, partitions_test) where data is a RDD of case class

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-06 Thread Nicholas Chammas
Nice catch Brad and thanks to Yin and Davies for getting on it so quickly. On Wed, Aug 6, 2014 at 2:45 AM, Davies Liu dav...@databricks.com wrote: There is a PR to fix this: https://github.com/apache/spark/pull/1802 On Tue, Aug 5, 2014 at 10:11 PM, Brad Miller bmill...@eecs.berkeley.edu

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back

Re: Save an RDD to a SQL Database

2014-08-06 Thread Yana
Hi Vida, I am writing to a DB -- or trying to :). I believe the best practice for this (you can search the mailing list archives) is to do a combination of mapPartitions and use a grouped iterator. Look at this thread, esp. the comment from A. Boisvert and Matei's comment above it:

PySpark, numpy arrays and binary data

2014-08-06 Thread Rok Roskar
Hello, I'm interested in getting started with Spark to scale our scientific analysis package (http://pynbody.github.io) to larger data sets. The package is written in Python and makes heavy use of numpy/scipy and related frameworks. I've got a couple of questions that I have not been able to

Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
My company is leaning towards moving much of their analytics work from our own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been the internal advocate for using Spark for analytics, but a number of good points have been brought up to me. The reasons being pushed are: -

Re: Spark shell creating a local SparkContext instead of connecting to connecting to Spark Master

2014-08-06 Thread Aniket Bhatnagar
Thanks. This worked :). I am thinking I should add this in spark-env.sh so that spark-shell always connects to master be default. On Aug 6, 2014 12:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: ​You can always start your spark-shell by specifying the master as

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
1) We get tooling out of the box from RedShift (specifically, stable JDBC access) - Spark we often are waiting for devops to get the right combo of tools working or for libraries to support sequence files. The arguments about JDBC access and simpler setup definitely make sense. My first

Re: PySpark, numpy arrays and binary data

2014-08-06 Thread Davies Liu
numpy array only can support basic types, so we can not use it during collect() by default. Could you give a short example about how numpy array is used in your project? On Wed, Aug 6, 2014 at 8:41 AM, Rok Roskar rokros...@gmail.com wrote: Hello, I'm interested in getting started with Spark

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Update: I can get it to work by disabling iptables temporarily. I can, however, not figure out on which port I have to accept traffic. 4040 and any of the Master or Worker ports mentioned in the previous post don't work. Can it be one of the randomly assigned ones in the 30k to 60k range? Those

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Just to point out that the benchmark you point to has Redshift running on HDD machines instead of SSD, and it is still faster than Shark in all but one case. Like Gary, I'm also interested in replacing something we have on Redshift with Spark SQL, as it will give me much greater capability to

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Andrew Or
Hi Andres, If you're using the EC2 scripts to start your standalone cluster, you can use ~/spark-ec2/copy-dir --delete ~/spark to sync your jars across the cluster. Note that you will need to restart the Master and the Workers afterwards through sbin/start-all.sh and sbin/stop-all.sh. If you're

Re: Submitting to a cluster behind a VPN, configuring different IP address

2014-08-06 Thread nunarob
Hi, I'm having the exact same problem - I'm on a VPN and I'm trying to set the proproperties spark.httpBroadcast.uri and spark.fileserver.uri so that they bind to my VPN ip instead of my regular network IP. Were you ever able to get this working? Cheers, -Rob -- View this message in

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread Andrew Or
Hi Simon, The drivers and executors currently choose random ports to talk to each other, so the Spark nodes will have to have full TCP access to each other. This is changed in a very recent commit, where all of these random ports will become configurable:

GraphX Pagerank application

2014-08-06 Thread AlexanderRiggers
I want to use pagerank on a 3GB textfile, which contains a bipartite list with variables id and brand. Example: id,brand 86246,15343 86246,27873 86246,14647 86246,55172 86246,3293 86246,2820 86246,3830 86246,2820 86246,5603 86246,72482 To perform the page rank I have to create a graph object,

Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Andrew Or
Hi Gary, This has indeed been a limitation of Spark, in that drivers and executors use random ephemeral ports to talk to each other. If you are submitting a Spark job from your local machine in client mode (meaning, the driver runs on your machine), you will need to open up all TCP ports from

Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
This will be awesome - it's been one of the major issues for our analytics team as they hope to use their own python libraries. On Wed, Aug 6, 2014 at 2:40 PM, Andrew Or and...@databricks.com wrote: Hi Gary, This has indeed been a limitation of Spark, in that drivers and executors use

Re: issue with spark and bson input

2014-08-06 Thread Dmitriy Selivanov
Finally I made it work. The trick was in asSubclass method: val mongoRDD = sc.newAPIHadoopFile(file:///root/jobs/dump/input.bson, classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, BSONObject]]), classOf[Object], classOf[BSONObject],

Re: Spark Streaming fails - where is the problem?

2014-08-06 Thread durin
Hi Andrew, for this test I only have one machine which provides the master and only worker. So all I'd need is communication to the Internet to access the twitter API. I've tried assigning a specific port to the driver and creating iptables rules for this port, but that didn't work. Best

heterogeneous cluster hardware

2014-08-06 Thread anthonyjschu...@gmail.com
I'm sure this must be a fairly common use-case for spark, yet I have not found a satisfactory discussion of it on the spark website or forum: I work at a company with a lot of previous-generation server hardware sitting idle-- I want to add this hardware to my spark cluster to increase

Spark memory management

2014-08-06 Thread Gary Malouf
I have a few questions about managing Spark memory: 1) In a standalone setup, is their any cpu prioritization across users running jobs? If so, what is the behavior here? 2) With Spark 1.1, users will more easily be able to run drivers/shells from remote locations that do not cause firewall

Spark build error

2014-08-06 Thread Priya Ch
Hi, I am trying to build jars using the command : mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package Execution of the above command is throwing the following error: [INFO] Spark Project Core . FAILURE [ 0.295 s] [INFO] Spark Project Bagel

Re: Unit Test for Spark Streaming

2014-08-06 Thread JiajiaJing
Thank you TD, I have worked around that problem and now the test compiles. However, I don't actually see that test running. As when I do mvn test, it just says BUILD SUCCESS, without any TEST section on stdout. Are we suppose to use mvn test to run the test? Are there any other methods can be

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Forgot to cc the mailing list :) On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Agreed. Being able to use SQL to make a table, pass it to a graph algorithm, pass that output to a machine learning algorithm, being able to invoke user defined python

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Mostly I was just objecting to Redshift does very well, but Shark is on par or better than it in most of the tests when that was not how I read the results, and Redshift was on HDDs. My bad. You are

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Also, regarding something like redshift not having MLlib built in, much of that could be done on the derived results. On Aug 6, 2014 4:07 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Mostly I was

UpdateStateByKey - How to improve performance?

2014-08-06 Thread Venkat Subramanian
The method def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) = Option[S] ): DStream[(K, S)] takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming We have a input Dstream(K,V) that has 40,000 elements. We update on average of 1000 elements of them in every 3

RE: Regarding tooling/performance vs RedShift

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Well yes, MLlib-like routines or pretty much anything else could be run on the derived results, but you have to unload the results from Redshift and then load them into some other tool. So it's nicer to leave them in memory and operate on them there. Major architectural advantage to Spark. Ron

Re: Writing to RabbitMQ

2014-08-06 Thread Tathagata Das
Yeah, I have observed this common problem in the design a number of times in the mailing list. Tobias, what you linked to is also an additional problem, that occurs with mapPartitions, but not with foreachPartitions (which is relevant here). But I do get your point. I think there was an attempt

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Nicholas Chammas
On Wed, Aug 6, 2014 at 4:30 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Major architectural advantage to Spark. Amen to that. For a really cool and succinct demonstration of this, check out Aaron's demo http://youtu.be/sPhyePwo7FA?t=10m16s at the Hadoop Summit earlier this ear

Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-06 Thread Shivaram Venkataraman
The output of lapply and lapplyPartition should the same by design -- The only difference is that in lapply the user-defined function returns a row, while it returns a list in lapplyPartition. Could you given an example of a small input and output that you expect to see for the above program ?

Re: Spark streaming at-least once guarantee

2014-08-06 Thread Tathagata Das
Why cant you keep a persistent queue of S3 files to process? The process that you are running has two threads Thread 1: Continuously gets SQS messages and write to the queue. This queue is persisted to the reliable storage, like HDFS / S3. Thread 2: Peek into the queue, and whenever there is any

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak
Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Tathagata Das
Hello Venkat, Your thoughts are quite spot on. The current implementation was designed to allow the functionality of timing out a state. For this to be possible, the update function need to be called each key even if there is no new data, so that the function can check things like last update

Re: Spark memory management

2014-08-06 Thread Andrew Or
Hey Gary, The answer to both of your questions is that much of it is up to the application. For (1), the standalone master can set spark.deploy.defaultCores to limit the number of cores each application can grab. However, the application can override this with the applications-specific

Re: Unit Test for Spark Streaming

2014-08-06 Thread Tathagata Das
Does it not show the name of the testsuite on stdout, showing that it has passed? Can you try writing a small test unit-test, in the same way as your kafka unit test, and with print statements on stdout ... to see whether it works? I believe it is some configuration issue in maven, which is hard

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread lbustelo
I'm running on spark 1.0.0 and I see a similar problem when using the socketTextStream receiver. The ReceiverTracker task sticks around after a ssc.stop(false). -- View this message in context:

Re: Spark stream data from kafka topics and output as parquet file on HDFS

2014-08-06 Thread Tathagata Das
You can use SparkSQL for that very easily. You can convert the rdds you get from kafka input stream, convert them to a RDDs of case classes and save as parquet files. More information here. https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files On Wed, Aug 6, 2014 at 5:23

Re: Spark Streaming multiple streams problem

2014-08-06 Thread Tathagata Das
You probably have only 10 cores in your cluster on which you are executing your job. since each dstream / receiver take one core each, the system is not able to start all of them and so everything is blocked. On Wed, Aug 6, 2014 at 3:08 AM, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi,

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
Can you give the stack trace? This was the fix for the twitter stream. https://github.com/apache/spark/pull/1577/files You could try doing the same. TD On Wed, Aug 6, 2014 at 2:41 PM, lbustelo g...@bustelos.com wrote: I'm running on spark 1.0.0 and I see a similar problem when using the

Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread salemi
Hi, I have a DStream called eventData and it contains set of Data objects defined as followed: case class Data(startDate: Long, endDate: Long, className: String, id: String, state: String) How would the reducer and inverse reducer functions look like if I would like to add the data for current

Naive Bayes parameters

2014-08-06 Thread SK
1) How is the minPartitions parameter in NaiveBayes example used? What is the default value? 2) Why is the numFeatures specified as a parameter? Can this not be obtained from the data? This parameter is not specified for the other MLlib algorithms. thanks -- View this message in context:

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread Tathagata Das
Why isnt a simple window function sufficient? eventData.window(Minutes(15), Seconds(3)) will keep generating RDDs every 3 second, each containing last 15 minutes of data. TD On Wed, Aug 6, 2014 at 3:43 PM, salemi alireza.sal...@udo.edu wrote: Hi, I have a DStream called eventData and it

Trying to make sense of the actual executed code

2014-08-06 Thread Tom
Hi, I am trying to look at for instance the following SQL query in Spark 1.1: SELECT table.key, table.value, table2.value FROM table2 JOIN table WHERE table2.key = table.key When I look at the output, I see that there are several stages, and several tasks per stage. The tasks have a TID, I do not

Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Hello, I am trying to use the python IDE PyCharm for Spark application development. How can I use pyspark with Python IDE? Can anyone help me with this? Thanks Sathish

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Mohit Singh
My naive set up.. Adding os.environ['SPARK_HOME'] = /path/to/spark sys.path.append(/path/to/spark/python) on top of my script. from pyspark import SparkContext from pyspark import SparkConf Execution works from within pycharm... Though my next step is to figure out autocompletion and I bet there

Hive 11 / CDH 4.6/ Spark 0.9.1 dilemmna

2014-08-06 Thread Anurag Tangri
I posted this in cdh-user mailing list yesterday and think this should have been the right audience for this: = Hi All, Not sure if anyone else faced this same issue or not. We installed CDH 4.6 that uses Hive 0.10. And we have Spark 0.9.1 that comes with Hive 11. Now our hive jobs that

Re: Hive 11 / CDH 4.6/ Spark 0.9.1 dilemmna

2014-08-06 Thread Sean Owen
I haven't tried any of this, mind you, but my guess is that your options are, from least painful and most likely to work onwards, are: - Get Spark / Shark to compile against Hive 0.10 - Shade Hive 0.11 into Spark - Update to CDH5.0+ I don't think there will be more updated releases of Shark or

Re: RDD to DStream

2014-08-06 Thread Tathagata Das
Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using

Re: Trying to make sense of the actual executed code

2014-08-06 Thread Michael Armbrust
This is maybe not exactly what you are asking for, but you might consider looking at the queryExecution (a developer API that shows how the query is analyzed / executed) sql(...).queryExecution On Wed, Aug 6, 2014 at 3:55 PM, Tom thubregt...@gmail.com wrote: Hi, I am trying to look at for

Fwd: Trying to make sense of the actual executed code

2014-08-06 Thread Tobias Pfeiffer
(Forgot to include the mailing list in my reply. Here it is.) Hi, On Thu, Aug 7, 2014 at 7:55 AM, Tom thubregt...@gmail.com wrote: When I look at the output, I see that there are several stages, and several tasks per stage. The tasks have a TID, I do not see such a thing for a stage. They

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
I narrowed down the error. Unfortunately this is not quick fix. I have opened a JIRA for this. https://issues.apache.org/jira/browse/SPARK-2892 On Wed, Aug 6, 2014 at 3:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Okay let me give it a shot. On Wed, Aug 6, 2014 at 3:57 PM,

Re: Regularization parameters

2014-08-06 Thread Burak Yavuz
Hi, That is interesting. Would you please share some code on how you are setting the regularization type, regularization parameters and running Logistic Regression? Thanks, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Wednesday,

PySpark + executor lost

2014-08-06 Thread Avishek Saha
Hi, I get a lot of executor lost error for saveAsTextFile with PySpark and Hadoop 2.4. For small datasets this error occurs but since the dataset is small it gets eventually written to the file. For large datasets, it takes forever to write the final output. Any help is appreciated. Avishek

Re: Using Python IDE for Spark Application Development

2014-08-06 Thread Sathish Kumaran Vairavelu
Mohit, This doesn't seems to be working can you please provide more details? when I use from pyspark import SparkContext it is disabled in pycharm. I use pycharm community edition. Where should I set the environment variables in same python script or different python script? Also, should I run

memory issue on standalone master

2014-08-06 Thread BQ
Hi There, I'm starting using spark and got a rookie problem. I used the standalone and master only, and here is what I did: ./sbin/start-master.sh ./bin/pyspark When I tried the example of wordcount.py, which my input file is a bit big, about I got the out of memory error, which I excerpted

Re: Regularization parameters

2014-08-06 Thread Mohit Singh
One possible straightforward explanation might be your solution(s) might be stuck in local minima?? And depending on your weights initialization, you are getting different parameters? Maybe have same initial weights for both the runs... or I would probably test the execution with synthetic

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread salemi
Hi, The reason I am looking to do it differently is because the latency and batch processing times are bad about 40 sec. I took the times from the Streaming UI. As you suggested I tried the window as below and still the times are bad. val dStream = KafkaUtils.createStream(ssc, zkQuorum, group,

Column width limits?

2014-08-06 Thread Daniel, Ronald (ELS-SDG)
Assume I want to make a PairRDD whose keys are S3 URLs and whose values are Strings holding the contents of those (UTF-8) files, but NOT split into lines. Are there length limits on those files/Strings? 1 MB? 16 MB? 4 GB? 1 TB? Similarly, can such a thing be registered as a table so that I can

答复: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Lizhengbing (bing, BIPA)
I have test it in spark-1.1.0-SNAPSHOT. It is ok now 发件人: Xiangrui Meng [mailto:men...@gmail.com] 发送时间: 2014年8月6日 23:12 收件人: Lizhengbing (bing, BIPA) 抄送: user@spark.apache.org 主题: Re: fail to run LBFS in 5G KDD data in spark 1.0.1? Do you mind testing 1.1-SNAPSHOT and allocating more memory to

spark-cassandra-connector issue

2014-08-06 Thread Gary Zhao
Hello I'm trying to modify Spark sample app to integrate with Cassandra, however I saw exception when submitting the app. Anyone knows why it happens? Exception in thread main java.lang.NoClassDefFoundError: com/datastax/spark/connector/rdd/reader/RowReaderFactory at