can kafka 10 stream API read the topic from a Kafka 9 cluster?

2016-12-22 Thread Joanne Contact
Hello I have a program which requires 0.10.1.0 streams API. The jar is
packaged by maven with all dependencies. I tried to consume a Kafka
topic spit from a Kafka 9 cluster.

It has such error:
 org.apache.kafka.common.protocol.types.SchemaException: Error reading
field 'topic_metadata': Error reading array of size 1768180577, only
167 bytes available

I wonder if there is any work around?

Thanks,

J


why spark and kafka always crash

2015-09-14 Thread Joanne Contact
How to prevent it?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



suggest configuration for debugging spark streaming, kafka

2015-08-26 Thread Joanne Contact
Hi I have a Ubuntu box with 4GB memory and duo cores. Do you think it
won't be enough to run spark streaming and kafka? I try to install
standalone mode spark kafka so I can debug them in IDE. Do I need to
install hadoop?

Thanks!

J

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Hi Gurus,

Please help.

But please don't tell me to use updateStateByKey because I need a
global variable (something like the clock time) across the micro
batches but not depending on key. For my case, it is not acceptable to
maintain a state for each key since each key comes in different times.
Yes my global variable is related to time but cannot use machine
clock.

Any hint? Or is this lack of global variable by design?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Thanks. I tried. The problem is I have to updateStatebyKey to maintain
other states related to keys.

Not sure where to pass this accumulator variable into updateStateBykey.


On Tue, Aug 18, 2015 at 2:17 AM, Hemant Bhanawat hemant9...@gmail.com wrote:
 See if SparkContext.accumulator helps.

 On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact joannenetw...@gmail.com
 wrote:

 Hi Gurus,

 Please help.

 But please don't tell me to use updateStateByKey because I need a
 global variable (something like the clock time) across the micro
 batches but not depending on key. For my case, it is not acceptable to
 maintain a state for each key since each key comes in different times.
 Yes my global variable is related to time but cannot use machine
 clock.

 Any hint? Or is this lack of global variable by design?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to convert a sequence of TimeStamp to a dataframe

2015-07-31 Thread Joanne Contact
Hi Guys,

I have struggled for a while on this seeming simple thing:

I have a sequence of timestamps and want to create a dataframe with 1 column.

Seq[java.sql.Timestamp]

//import collection.breakOut

var seqTimestamp = scala.collection.Seq(listTs:_*)

seqTimestamp: Seq[java.sql.Timestamp] = List(2015-07-22 16:52:00.0,
2015-07-22 16:53:00.0, ., )

I tried a lot of ways to create a dataframe and below is another failed way:

import sqlContext.implicits._
var rddTs = sc.parallelize(seqTimestamp)
rddTs.toDF(minInterval)

console:108: error: value toDF is not a member of
org.apache.spark.rdd.RDD[java.sql.Timestamp] rddTs.toDF(minInterval)

So, any guru could please tell me how to do this

I am not familiar with Scala or Spark. I wonder if learning Scala will
help this at all? It just sounds a lot of time of trial/error and
googling.

docs like
https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/DataFrame.html
https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/sql/SQLContext.html#createDataFrame(scala.collection.Seq,
scala.reflect.api.TypeTags.TypeTag)
does not help.

Btw, I am using Spark 1.4.

Thanks in advance,

J

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Joanne Contact
Hello Sparkers,

I kept getting this error:

java.lang.ClassCastException: scala.Tuple2 cannot be cast to
org.apache.spark.mllib.regression.LabeledPoint

I have tried the following to convert v._1 to double:

Method 1:

(if(v._10) 1d else 0d)

Method 2:

def bool2Double(b:Boolean): Double = {
  if (b) 1.0
  else 0.0
}

bool2Double(v._10)

Method 3:
implicit def bool2Double(b:Boolean): Double = {
  if (b) 1.0
  else 0.0
}


None of them works.

Any advice would be appreciated.

Thanks!

J

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



unable to read avro file

2015-03-27 Thread Joanne Contact
Hi I am following the instruction on this website.
http://www.infoobjects.com/spark-with-avro/

I installed the sparkavro libary on https://github.com/databricks/spark-avro
on a machine which only has hive gateway client role on a hadoop cluster.

somehow I got error on reading the avro file.

scala val ufos =
sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.2099222123.142749720.avro)

console:20: error: erroneous or inaccessible type
   val ufos =
sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.2099222123.142749720.avro)

Any advice please?

Thank you!

J


Re: unable to read avro file

2015-03-27 Thread Joanne Contact
never mind. find my spark is still 1.2 but the avro library requires 1.3.
will try again.

On Fri, Mar 27, 2015 at 9:38 PM, Joanne Contact joannenetw...@gmail.com
wrote:

 Hi I am following the instruction on this website.
 http://www.infoobjects.com/spark-with-avro/

 I installed the sparkavro libary on https://
 github.com/databricks/spark-avro
 on a machine which only has hive gateway client role on a hadoop cluster.

 somehow I got error on reading the avro file.

 scala val ufos =
 sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.
 2099222123.142749720.avro)

 console:20: error: erroneous or inaccessible type
val ufos =
 sqlContext.avroFile(hdfs://sss/x/iii/iiiAaaaSs/hh//00/22/22/iiiAaaaSs.6.11.469465.
 2099222123.142749720.avro)

 Any advice please?

 Thank you!

 J



Is Ubuntu server or desktop better for spark cluster

2015-02-14 Thread Joanne Contact
Hi gurus,

I am trying to install a real linux machine(not VM) where i will install spark 
also Hadoop. I plan on learning the clusters. 

I found Ubuntu has desktop and server versions. Do it matter? 

Thanks!!

J
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



any API to load large data from web into Cassandra

2014-12-26 Thread Joanne Contact
Hello I am new. Did not seem to find the answer after a brief research.
Please help.

Thanks!

J


any code to load large data from web into Cassandra

2014-12-26 Thread Joanne Contact
Thank you. I did not express clearly on my question.

I wonder if there is sample code to load any website data to Cassandra?

Say, this webpage http://datatomix.com/?p=84 seems to use Python, tweepy,
to use twitter API to get data in json format and then load data into
Cassandra.

So it seems tweepy is special for twitter API. Is there a code for any
website?
Btw I am not familiar with Python yet. So the answer may not be limited to
Python.

Thanks!

On Fri, Dec 26, 2014 at 12:46 PM, Keith Sterling 
keith.sterl...@first-utility.com wrote:

 Take a look at sstableloader. We use it to load 30+m rows into Cassandra

 Datastax documentation is a good staty

 --
 *Keith Sterling*
 *Head of Software*

  *E:* keith.sterl...@first-utility.com stephen.l...@first-utility.com
  *P:* +44 7771 597 630
  *W:* first-utility.com http://www.first-utility.com/
  *A:* Opus 40 Business Park,
 Haywood Road, Warwick CV34 5AH



 On Fri, Dec 26, 2014 at 7:59 PM, Joanne Contact joannenetw...@gmail.com
 wrote:

  Hello I am new. Did not seem to find the answer after a brief research.
 Please help.

 Thanks!

 J





StreamingLinearRegressionWithSGD

2014-12-01 Thread Joanne Contact
Hi Gurus,

I did not look at the code yet. I wonder if StreamingLinearRegressionWithSGD
http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html

is equivalent to
LinearRegressionWithSGD
http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.htmlwith
starting weights of the current batch as the ending weights of the last
batch?

Since RidgeRegressionModel
http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/regression/RidgeRegressionModel.html
does
not seem to have a streaming version, just wonder if this way will suffice.


Thanks!

J


Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Hi Gurus,

Sorry for my naive question. I am new.

I seemed to read somewhere that spark is still batch learning, but spark
streaming could allow online learning.

I could not find this on the website now.

http://spark.apache.org/docs/latest/streaming-programming-guide.html

I know MLLib uses incremental or iterative algorithms, I wonder if this is
also true between batches of spark streaming.

So the question is: say, when I call MLLib linear regression, does the
training use one batch data as training data, if yes, then the model update
between batches is already taken care of? That is, the model will
eventually use all data that arrived from the beginning until current time
of scoring as the training data, or the model only use data coming in the
past limited number of batches as training data?


Many thanks!

J


Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Thank you Tobias!

On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com
 wrote:

 I seemed to read somewhere that spark is still batch learning, but spark
 streaming could allow online learning.


 Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently
 can do online learning only for linear regression 
 https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html#streaming-linear-regression,
 as far as I know.

 Tobias




Persist kafka streams to text file

2014-11-21 Thread Joanne Contact
Hello I am trying to read kafka stream to a text file by running spark from
my IDE (IntelliJ IDEA) . The code is similar as a previous thread on
persisting stream to a text file.

I am new to spark or scala. I believe the spark is on local mode as the
console shows
14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration:
spark.app.name=local-mode

 I got the following errors. It is related to Tachyon. But I don't know if
I have tachyon or not.

14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create
tachyon dir null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to
master localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 31 more
14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Failed 10 attempts to
create tachyon dir in
/tmp_spark_tachyon/spark-3dbec68b-f5b8-45e1-bb68-370439839d4a/driver

I looked at the code. It has the following part. Is that a problem?

.persist(StorageLevel.OFF_HEAP)

Any advice?

Thank you!

J


Persist kafka streams to text file, tachyon error?

2014-11-21 Thread Joanne Contact
use the right email list.
-- Forwarded message --
From: Joanne Contact joannenetw...@gmail.com
Date: Fri, Nov 21, 2014 at 2:32 PM
Subject: Persist kafka streams to text file
To: u...@spark.incubator.apache.org


Hello I am trying to read kafka stream to a text file by running spark from
my IDE (IntelliJ IDEA) . The code is similar as a previous thread on
persisting stream to a text file.

I am new to spark or scala. I believe the spark is on local mode as the
console shows
14/11/21 14:17:11 INFO spark.SparkContext: Spark configuration:
spark.app.name=local-mode

 I got the following errors. It is related to Tachyon. But I don't know if
I have tachyon or not.

14/11/21 14:17:54 WARN storage.TachyonBlockManager: Attempt 1 to create
tachyon dir null failed
java.io.IOException: Failed to connect to master localhost/127.0.0.1:19998
after 5 attempts
at tachyon.client.TachyonFS.connect(TachyonFS.java:293)
at tachyon.client.TachyonFS.getFileId(TachyonFS.java:1011)
at tachyon.client.TachyonFS.exist(TachyonFS.java:633)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:117)
at
org.apache.spark.storage.TachyonBlockManager$$anonfun$createTachyonDirs$2.apply(TachyonBlockManager.scala:106)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.storage.TachyonBlockManager.createTachyonDirs(TachyonBlockManager.scala:106)
at
org.apache.spark.storage.TachyonBlockManager.init(TachyonBlockManager.scala:57)
at
org.apache.spark.storage.BlockManager.tachyonStore$lzycompute(BlockManager.scala:88)
at org.apache.spark.storage.BlockManager.tachyonStore(BlockManager.scala:82)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:729)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:594)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:145)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: tachyon.org.apache.thrift.TException: Failed to connect to
master localhost/127.0.0.1:19998 after 5 attempts
at tachyon.master.MasterClient.connect(MasterClient.java:178)
at tachyon.client.TachyonFS.connect(TachyonFS.java:290)
... 28 more
Caused by: tachyon.org.apache.thrift.transport.TTransportException:
java.net.ConnectException: Connection refused
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:185)
at
tachyon.org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at tachyon.master.MasterClient.connect(MasterClient.java:156)
... 29 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at tachyon.org.apache.thrift.transport.TSocket.open(TSocket.java:180)
... 31 more
14/11/21 14:17:54 ERROR storage.TachyonBlockManager: Failed 10 attempts to
create tachyon dir in
/tmp_spark_tachyon/spark-3dbec68b-f5b8-45e1-bb68-370439839d4a/driver

I looked at the code. It has the following part. Is that a problem?

.persist(StorageLevel.OFF_HEAP)

Any advice?

Thank you!

J