can kafka 10 stream API read the topic from a Kafka 9 cluster?
Hello I have a program which requires 0.10.1.0 streams API. The jar is packaged by maven with all dependencies. I tried to consume a Kafka topic spit from a Kafka 9 cluster. It has such error: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'topic_metadata': Error reading array of size 1768180577, only 167 bytes available I wonder if there is any work around? Thanks, J
why spark and kafka always crash
How to prevent it? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
suggest configuration for debugging spark streaming, kafka
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it won't be enough to run spark streaming and kafka? I try to install standalone mode spark kafka so I can debug them in IDE. Do I need to install hadoop? Thanks! J - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
global variable in spark streaming with no dependency on key
Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not depending on key. For my case, it is not acceptable to maintain a state for each key since each key comes in different times. Yes my global variable is related to time but cannot use machine clock. Any hint? Or is this lack of global variable by design? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: global variable in spark streaming with no dependency on key
Thanks. I tried. The problem is I have to updateStatebyKey to maintain other states related to keys. Not sure where to pass this accumulator variable into updateStateBykey. On Tue, Aug 18, 2015 at 2:17 AM, Hemant Bhanawat hemant9...@gmail.com wrote: See if SparkContext.accumulator helps. On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not depending on key. For my case, it is not acceptable to maintain a state for each key since each key comes in different times. Yes my global variable is related to time but cannot use machine clock. Any hint? Or is this lack of global variable by design? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to convert a sequence of TimeStamp to a dataframe
Hi Guys, I have struggled for a while on this seeming simple thing: I have a sequence of timestamps and want to create a dataframe with 1 column. Seq[java.sql.Timestamp] //import collection.breakOut var seqTimestamp = scala.collection.Seq(listTs:_*) seqTimestamp: Seq[java.sql.Timestamp] = List(2015-07-22 16:52:00.0, 2015-07-22 16:53:00.0, ., ) I tried a lot of ways to create a dataframe and below is another failed way: import sqlContext.implicits._ var rddTs = sc.parallelize(seqTimestamp) rddTs.toDF(minInterval) console:108: error: value toDF is not a member of org.apache.spark.rdd.RDD[java.sql.Timestamp] rddTs.toDF(minInterval) So, any guru could please tell me how to do this I am not familiar with Scala or Spark. I wonder if learning Scala will help this at all? It just sounds a lot of time of trial/error and googling. docs like https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/DataFrame.html https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/SQLContext.html#createDataFrame(scala.collection.Seq, scala.reflect.api.TypeTags.TypeTag) does not help. Btw, I am using Spark 1.4. Thanks in advance, J - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint
Hello Sparkers, I kept getting this error: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint I have tried the following to convert v._1 to double: Method 1: (if(v._10) 1d else 0d) Method 2: def bool2Double(b:Boolean): Double = { if (b) 1.0 else 0.0 } bool2Double(v._10) Method 3: implicit def bool2Double(b:Boolean): Double = { if (b) 1.0 else 0.0 } None of them works. Any advice would be appreciated. Thanks! J - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
unable to read avro file
Hi I am following the instruction on this website. http://www.infoobjects.com/spark-with-avro/ I installed the sparkavro libary on https://github.com/databricks/spark-avro on a machine which only has hive gateway client role on a hadoop cluster. somehow I got error on reading the avro file. scala val ufos = sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.2099222123.142749720.avro) console:20: error: erroneous or inaccessible type val ufos = sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.2099222123.142749720.avro) Any advice please? Thank you! J
Re: unable to read avro file
never mind. find my spark is still 1.2 but the avro library requires 1.3. will try again. On Fri, Mar 27, 2015 at 9:38 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi I am following the instruction on this website. http://www.infoobjects.com/spark-with-avro/ I installed the sparkavro libary on https:// github.com/databricks/spark-avro on a machine which only has hive gateway client role on a hadoop cluster. somehow I got error on reading the avro file. scala val ufos = sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465. 2099222123.142749720.avro) console:20: error: erroneous or inaccessible type val ufos = sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465. 2099222123.142749720.avro) Any advice please? Thank you! J
Is Ubuntu server or desktop better for spark cluster
Hi gurus, I am trying to install a real linux machine(not VM) where i will install spark also Hadoop. I plan on learning the clusters. I found Ubuntu has desktop and server versions. Do it matter? Thanks!! J - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
any API to load large data from web into Cassandra
Hello I am new. Did not seem to find the answer after a brief research. Please help. Thanks! J
any code to load large data from web into Cassandra
Thank you. I did not express clearly on my question. I wonder if there is sample code to load any website data to Cassandra? Say, this webpage http://datatomix.com/?p=84 seems to use Python, tweepy, to use twitter API to get data in json format and then load data into Cassandra. So it seems tweepy is special for twitter API. Is there a code for any website? Btw I am not familiar with Python yet. So the answer may not be limited to Python. Thanks! On Fri, Dec 26, 2014 at 12:46 PM, Keith Sterling keith.sterl...@first-utility.com wrote: Take a look at sstableloader. We use it to load 30+m rows into Cassandra Datastax documentation is a good staty -- *Keith Sterling* *Head of Software* *E:* keith.sterl...@first-utility.com stephen.l...@first-utility.com *P:* +44 7771 597 630 *W:* first-utility.com http://www.first-utility.com/ *A:* Opus 40 Business Park, Haywood Road, Warwick CV34 5AH On Fri, Dec 26, 2014 at 7:59 PM, Joanne Contact joannenetw...@gmail.com wrote: Hello I am new. Did not seem to find the answer after a brief research. Please help. Thanks! J
StreamingLinearRegressionWithSGD
Hi Gurus, I did not look at the code yet. I wonder if StreamingLinearRegressionWithSGD http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html is equivalent to LinearRegressionWithSGD http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.htmlwith starting weights of the current batch as the ending weights of the last batch? Since RidgeRegressionModel http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/RidgeRegressionModel.html does not seem to have a streaming version, just wonder if this way will suffice. Thanks! J
Is spark streaming +MlLib for online learning?
Hi Gurus, Sorry for my naive question. I am new. I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. I could not find this on the website now. http://spark.apache.org/docs/latest/streaming-programming-guide.html I know MLLib uses incremental or iterative algorithms, I wonder if this is also true between batches of spark streaming. So the question is: say, when I call MLLib linear regression, does the training use one batch data as training data, if yes, then the model update between batches is already taken care of? That is, the model will eventually use all data that arrived from the beginning until current time of scoring as the training data, or the model only use data coming in the past limited number of batches as training data? Many thanks! J
Re: Is spark streaming +MlLib for online learning?
Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com wrote: I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning only for linear regression https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html#streaming-linear-regression, as far as I know. Tobias
Persist kafka streams to text file
Hello I am trying to read kafka stream to a text file by running spark from my IDE (IntelliJ IDEA) . The code is similar as a previous thread on persisting stream to a text file. I am new to spark or scala. I believe the spark is on local mode as the console shows 14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration: spark.app.name=local-mode I got the following errors. It is related to Tachyon. But I don't know if I have tachyon or not. 14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by: tachyon.org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185) at tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at tachyon.master.MasterClient.connect(MasterClient.java:156) ... 29 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180) ... 31 more 14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-3dbec68b-f5b8-45e1-bb68-370439839d4a/driver I looked at the code. It has the following part. Is that a problem? .persist(StorageLevel.OFF_HEAP) Any advice? Thank you! J
Persist kafka streams to text file, tachyon error?
use the right email list. -- Forwarded message -- From: Joanne Contact joannenetw...@gmail.com Date: Fri, Nov 21, 2014 at 2:32 PM Subject: Persist kafka streams to text file To: u...@spark.incubator.apache.org Hello I am trying to read kafka stream to a text file by running spark from my IDE (IntelliJ IDEA) . The code is similar as a previous thread on persisting stream to a text file. I am new to spark or scala. I believe the spark is on local mode as the console shows 14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration: spark.app.name=local-mode I got the following errors. It is related to Tachyon. But I don't know if I have tachyon or not. 14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create tachyon dir null failed java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.client.TachyonFS.connect(TachyonFS.java:293) at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011) at tachyon.client.TachyonFS.exist(TachyonFS.java:633) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117) at org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106) at org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57) at org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88) at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: tachyon.org.apache.thrift.TException: Failed to connect to master localhost/127.0.0.1:19998 after 5 attempts at tachyon.master.MasterClient.connect(MasterClient.java:178) at tachyon.client.TachyonFS.connect(TachyonFS.java:290) ... 28 more Caused by: tachyon.org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185) at tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81) at tachyon.master.MasterClient.connect(MasterClient.java:156) ... 29 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180) ... 31 more 14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-3dbec68b-f5b8-45e1-bb68-370439839d4a/driver I looked at the code. It has the following part. Is that a problem? .persist(StorageLevel.OFF_HEAP) Any advice? Thank you! J