Hi,
I am getting an error in the Query Plan when I use the SQL statement
exactly as you have suggested. Is that the exact SQL statement I should be
using (I am not very familiar with SQL syntax)?
I also tried using the SchemaRDD's subtract method to perform this query.
Hi,
I have installed 2 node hadoop cluster (For example, on Unix machines A and
B. A master node and data node, B is data node)
I am submitting my driver programs through SPARK 1.1.0 with bin/spark-submit
from Putty Client from my Windows machine.
I want to debug my program from Eclipse
Its called remote debugging. You can read this article
http://www.eclipse.org/jetty/documentation/current/enable-remote-debugging.html
for more information. You will have to make sure that the network between
your cluster and windows machine can communicate with each other.
Thanks
Best Regards
Can you be more specific about numbers?
I am not sure that splitting helps so much in the end, in that it has
the same effect as executing a smaller number at a time of the large
number of tasks that the full cartesian join would generate.
The full join is probably intractable no matter what in
I've tried:
. /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
scala import com.twitter.algebird._
import com.twitter.algebird._
scala import HyperLogLog._
import HyperLogLog._
scala import com.twitter.algebird.HyperLogLogMonoid
import com.twitter.algebird.HyperLogLogMonoid
scala val
Hi all,
In Spark Streaming, when I do foreachRDD on my DStreams, I get a
NonSerializable exception when I try to do something like:
DStream.foreachRDD( rdd = {
var sc.parallelize(Seq((test, blah)))
})
Is there any way around that ?
Thanks,
Harold
Very coincidentally I ran into something equally puzzling yesterday where
something was bizarrely null when it can't have been in a Spark program
that extends App. I also changed to use main() and it works fine. So
definitely some issue here. If nobody makes a JIRA before I get home I'll
do it.
On
I'm processing some data using PySpark and I'd like to save the RDDs to disk
(they are (k,v) RDDs of strings and SparseVector types) and read them in
using Scala to run them through some other analysis. Is this possible?
Thanks,
Rok
--
View this message in context:
Thanks
Best Regards
On Thu, Oct 30, 2014 at 1:43 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Awesome.
Thanks
Best Regards
On Thu, Oct 30, 2014 at 1:30 PM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
Thanks Akhil, It is working.
Regards,
Naveen
*From:* Akhil Das
I'm testing beta driver from Databricks for Tableua.
And unfortunately i encounter some issues.
While beeline connection works without problems, Tableau can't connect to
spark thrift server.
Error from driver(Tableau):
Unable to connect to the ODBC Data Source. Check that the necessary drivers
Hi,
I'm new to Mllib and spark. I'm trying to use tf-idf and use those values
for term ranking.
I'm getting tf values in vector format, but how can get the values of
vector?
val sc = new SparkContext(conf)
val documents: RDD[Seq[String]] =
The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
As a template for creating a broadcast variable, the following code snippet
within mllib was used:
val bcIdf = dataset.context.broadcast(idf)
dataset.mapPartitions { iter =
val thisIdf = bcIdf.value
The new code follows that model:
import org.apache.spark.mllib.linalg.{Vector =
What ODBC driver are you using? We recently got the Hortonworks JODBC drivers
working on a Windows box but was having issues with Mac
Sent from my iPhone
On Oct 30, 2014, at 4:23 AM, Bojan Kostic blood9ra...@gmail.com wrote:
I'm testing beta driver from Databricks for Tableua.
And
I use beta driver SQL ODBC from Databricks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Ladies and Gents,
I would like to know what are the options I have if I would like to
leverage Spark code I already have written to use a DB (Vertica) as its
store/datasource.
The data is of tabular nature. So any relational DB can essentially be used.
Do I need to develop a context? If yes,
Hi,
Previous we have applied SVM algorithm in MLlib to 5 million records (600
mb), it takes more than 25 minutes to finish.
The spark version we are using is 1.0 and we were running this program on a
4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
The 5 million records only have two
Whats the error with the 2.10 version of algebird?
On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com wrote:
I've tried:
. /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
scala import com.twitter.algebird._
import com.twitter.algebird._
scala import HyperLogLog._
import
Roberto, I don't think shark is an issue -- I have shark server running on
a node that also acts as a worker. What you can do is turn off shark
server, just run start-all to start your spark cluster. then you can try
bin/spark-shell --master yourmasterip and see if you can successfully run
some
I also didn’t realize I was trying to bring up the 2ndNameNode as a slave..
that might be an issue as well..
Thanks,
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Thursday, October 30, 2014 11:27 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with
When you are starting the thrift server service - are you connecting to it
locally or is this on a remote server when you use beeline and/or Tableau?
On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic blood9ra...@gmail.com wrote:
I use beta driver SQL ODBC from Databricks.
--
View this message
Watch the app manager it should tell you what's running and taking awhile... My
guess it's a distinct function on the data.
J
Sent from my iPhone
On Oct 30, 2014, at 8:22 AM, peng xia toxiap...@gmail.com wrote:
Hi,
Previous we have applied SVM algorithm in MLlib to 5 million records
Hi.
I am running an application in the Spark which first loads data from
Cassandra and then performs some map/reduce jobs.
val srdd = sqlContext.sql(select * from mydb.mytable )
I noticed that the srdd only has one partition . no matter how big is the
data loaded form Cassandra.
So I perform
I'm connecting to it remotly with tableau/beeline.
On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List]
wrote:
When you are starting the thrift server service - are you connecting to it
locally or is this on a remote server when you use beeline and/or Tableau?
GC limit overhead exceeded is usually sign of either inadequate heap size
(not the case here) or application produces garbage (temp objects) faster
than garbage collector collects them - GC consumes most CPU cycles. 17G of
Java heap is quite large for many application and is above safe and
Hi,
Sorry, there's a typo there:
val arr = rdd.toArray
Harold
On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization
Hi all,
I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization errors
and am not sure how to use the correct design pattern. I was wondering if
you could help me.
I'd like to be able to do the following:
Thanks.. I was using Scala 2.11.1 and was able to
use algebird-core_2.10-0.1.11.jar with spark-shell.
On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell i...@ianoconnell.com wrote:
Whats the error with the 2.10 version of algebird?
On Thu, Oct 30, 2014 at 12:49 AM, thadude ohpre...@yahoo.com
Hi Helena,
Well... I am just running a toy example, I have one Cassandra node
co-located with the Spark Master and one of Spark Workers, all in one
machine. I have another node which runs the second Spark worker.
/Shahab,
On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson
DId you cache the data and check the load balancing? How many
features? Which API are you using, Scala, Java, or Python? -Xiangrui
On Thu, Oct 30, 2014 at 9:13 AM, Jimmy ji...@sellpoints.com wrote:
Watch the app manager it should tell you what's running and taking awhile...
My guess it's a
Hi,
I noticed that the count (of RDD) in many of my queries is the most time
consuming one as it runs in the driver process rather then done by
parallel worker nodes,
Is there any way to perform count in parallel , at at least parallelize
it as much as possible?
best,
/Shahab
Shahab,
Regardless, WRT cassandra and spark when using the spark cassandra connector,
‘spark.cassandra.input.split.size’ passed into the SparkConf configures the
approx number of Cassandra partitions in a Spark partition (default 10).
No repartitioning should be necessary with what you
You can remove 0.5 from all non-zeros. -Xiangrui
On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I have my sparse data in libsvm format.
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
mllib/data/sample_libsvm_data.txt)
I am running Linear
Hello everyone,
I'm trying to use MLlib's K-mean algorithm.
I tried it on raw data, Here is a example of a line contained in my input
data set:
82.9817 3281.4495
with those parameters:
*numClusters*=4
*numIterations*=20
results:
*WSSSE = 6.375371241589461E9*
Then I normalized my data:
The byte array turns out to be a serialized ObjectOutputStream that
contains a Tuple2[ParallelCollectionRDD,Function2].
What then should be done differently in the broadcast code (which follows
the structure of an example taken from mllib)?
assert(crows.isInstanceOf[Array[MVector]])
val bcRows
Hi,
Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple
python code in a ec2 cluster. The same program runs correctly in local mode.
So this error only happens when running in a real cluster.
Here's what I got,
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in
Hi Shahab,
Are you running Spark in Local, Standalone, YARN or Mesos mode?
If you're running in Standalone/YARN/Mesos, then the .count() action is
indeed automatically parallelized across multiple Executors.
When you run a .count() on an RDD, it is actually distributing tasks to
different
By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as that could make a big speed improvement.
On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com
wrote:
Hi Shahab,
Are you running Spark in Local, Standalone, YARN or
Found this as I am having the same issue. I have exactly the same usage as
shown in Michael's join example. I tried executing a SQL statement against
the join data set with two columns that have the same name and tried to
unambiguate the column name with the table alias, but I would still get an
The worker side has error message as this,
14/10/30 18:29:00 INFO Worker: Asked to launch executor
app-20141030182900-0006/0 for testspark_v1
14/10/30 18:29:01 INFO ExecutorRunner: Launch command: java -cp
I'm trying to implement a Spark Streaming program to calculate the number of
instances of a given key encountered and the minimum and maximum times at
which it was encountered. updateStateByKey seems to be just the thing, but
when I define the state to be a tuple, I get compile errors I'm not
Thanks for all your help.
I think I didn't cache the data. My previous cluster was expired and I
don't have a chance to check the load balance or app manager.
Below is my code.
There are 18 features for each record and I am using the Scala API.
import org.apache.spark.SparkConf
import
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env.
On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev buntu...@gmail.com wrote:
Thanks.. I was using Scala 2.11.1 and was able to
use algebird-core_2.10-0.1.11.jar with spark-shell.
On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell
curious about why you're only seeing 50 records max per batch.
how many receivers are you running? what is the rate that you're putting
data onto the stream?
per the default AWS kinesis configuration, the producer can do 1000 PUTs
per second with max 50k bytes per PUT and max 1mb per second per
No, but, the JVM also does not allocate memory for native code on the heap.
I dont think heap has any bearing on whether your native code can't
allocate more memory except that of course the heap is also taking memory.
On Oct 30, 2014 6:43 PM, Paul Wais pw...@yelp.com wrote:
Dear Spark List,
I
Yes that is exactly it. The values are not comparable since normalization
is also shrinking all distances. Squared error is not an absolute metric.
I haven't thought about this much but maybe you are looking for something
like the silhouette coefficient?
On Oct 30, 2014 5:35 PM, mgCl2
Hi, I saw the same issue as this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html
Anyone has a fix for this bug? Please?!
The log info in my worker node is like,
14/10/30 20:15:18 INFO Worker: Asked to kill executor
vHi,
I've been exploring the metrics exposed by Spark and I'm wondering whether
there's a way to register job-specific metrics that could be exposed
through the existing metrics system.
Would there be an example somewhere?
BTW, documentation about how the metrics work could be improved. I
I have one job with spark that creates some RDDs of type X and persists them
in memory. The type X is an auto generated Thrift java class (not a case
class though). Now in another job, I want to convert the RDD to a SchemaRDD
using sqlContext.applySchema(). Can I derive a schema from the thrift
That should be possible, although I'm not super familiar with thrift.
You'll probably need access to the generated metadata
http://people.apache.org/~thejas/thrift-0.9/javadoc/org/apache/thrift/meta_data/package-frame.html
.
Shameless plug If you find yourself reading a lot of thrift data you
Thanks Helena, very useful comment,
But is ‘spark.cassandra.input.split.size only effective in Cluster not in
Single node?
best,
/Shahab
On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson helena.edel...@datastax.com
wrote:
Shahab,
Regardless, WRT cassandra and spark when using the spark
I think I understand how to deal with this, though I don't have all the code
working yet. The point is that the V of (K, V) can itself be a tuple. So
the updateFunc prototype looks something like
val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state:
Option[Tuple1[(Int,
Hi,
I'm writing a paper and I need to calculate tf-idf. Whit your help I
managed to get results, I needed, but the problem is that I need to be able
to explain how each number was gotten. So I tried to understand how idf was
calculated and the numbers i get don't correspond to those I should get .
Hi Andrejs,The calculations are a bit different to what I've come across in
Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available
here:http://www.mmds.org/
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number
Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration. -Xiangrui
On Thu, Oct 30, 2014 at 11:44 AM, peng xia toxiap...@gmail.com wrote:
Thanks for all your help.
I think I didn't cache the data. My previous cluster was expired and I
Hi,
While testing SparkSQL on top of our Hive metastore, I am getting
some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
table.
Basically, I have a table mtable partitioned by some date field in hive
and below is the scala code I am running in spark-shell:
val sqlContext =
followed this
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602
but the problem was not fixed..
--
View this message in context:
Hi All,
When I load an RDD with:
data = sc.textFile(somefile)
I don't see the resulting RDD in the SparkContext gui on localhost:4040 in
/storage.
Is there something special I need to do to allow me to view this? I tried
but scala and python shells but same result.
Thanks
Stuart
Hey Stuart,
The RDD won't show up under the Storage tab in the UI until it's been
cached. Basically Spark doesn't know what the RDD will look like until it's
cached, b/c up until then the RDD is just on disk (external to Spark). If
you launch some transformations + an action on an RDD that is
Sorry too quick to pull the trigger on my original email. I should have
added that I'm tried using persist() and cache() but no joy.
I'm doing this:
data = sc.textFile(somedata)
data.cache
data.count()
but I still can't see anything in the storage?
On 31 October 2014 10:42, Sameer
Hi Xiangrui,
Can you give me some code example about caching, as I am new to Spark.
Thanks,
Best,
Peng
On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration.
How do I build the scaladoc html files from the spark source distribution?
Alex Bareta
Hi Stuart,
You're close!
Just add a () after the cache, like: data.cache()
...and then run the .count() action on it and you should be good to see it
in the Storage UI!
- Sameer
On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman stuart.hors...@gmail.com
wrote:
Sorry too quick to pull the
sampleRDD. cache()
Sent from my iPhone
On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote:
Hi Xiangrui,
Can you give me some code example about caching, as I am new to Spark.
Thanks,
Best,
Peng
On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng men...@gmail.com wrote:
Hi,
I've been trying to move up from spark 0.9.2 to 1.1.0.
I'm getting a little confused with the setup for a few different use cases,
grateful for any pointers...
(1) spark-shell + with jars that are only required by the driver
(1a)
I added spark.driver.extraClassPath /mypath/to.jar to my
Thanks Jimmy.
I will have a try.
Thanks very much for your guys' help.
Best,
Peng
On Thu, Oct 30, 2014 at 8:19 PM, Jimmy ji...@sellpoints.com wrote:
sampleRDD. cache()
Sent from my iPhone
On Oct 30, 2014, at 5:01 PM, peng xia toxiap...@gmail.com wrote:
Hi Xiangrui,
Can you give me some
Try using --jars instead of the driver-only options; they should work with
spark-shell too but they may be less tested.
Unfortunately, you do have to specify each JAR separately; you can maybe use a
shell script to list a directory and get a big list, or set up a project that
builds all of the
-- jars does indeed work but this causes the jars to also get shipped to
the workers -- which I don't want to do for efficiency reasons.
I think you are saying that setting spark.driver.extraClassPath in
spark-default.conf ought to have the same behavior as providing
--driver.class.apth to
Yeah, I think you should file this as a bug. The problem is that JARs need to
also be added into the Scala compiler and REPL class loader, and we probably
don't do this for the ones in this driver config property.
Matei
On Oct 30, 2014, at 6:07 PM, Shay Seng s...@urbanengines.com wrote:
--
My other question is that Spark why not provide foldLeft:
*def foldLeft[U](zeroValue: U)(op: (U, T) = T): U *but aggregate.
the *def fold(zeroValue: T)(op: (T, T) = T): T* in spark is not
deterministic too.
On Thu, Oct 30, 2014 at 3:50 PM, Jason Zaugg jza...@gmail.com wrote:
On Thu, Oct 30,
Hmmm, this looks like a bug. Can you file a JIRA?
On Thu, Oct 30, 2014 at 4:04 PM, Jean-Pascal Billaud j...@tellapart.com
wrote:
Hi,
While testing SparkSQL on top of our Hive metastore, I am getting
some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
table.
RDD.toLocalIterator return the partition one by one but with all elements in
the partition, which is not lazy calculated. Given the design of spark, it is
very hard to maintain the state of iterator across runJob.
def toLocalIterator: Iterator[T] = {
def collectPartition(p: Int): Array[T]
Hi All,
We have recently moved to Spark 1.1 from 0.9 for an application handling a
fair number of very large datasets partitioned across multiple nodes. About
half of each of these large datasets is stored in off heap byte arrays and
about half in the standard Java heap.
While these datasets are
The problem is simple
I want a to stream data 24/7 do some calculations and save the result in a
csv/json file so that i could use it for visualization using dc.js/d3.js
I opted for spark streaming on yarn cluster with kafka tried running it for
24/7
Using GroupByKey and updateStateByKey to
Hey Sameer,
Wouldnt local[x] run count parallelly in each of the x threads?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com
wrote:
Hi Shahab,
Are you running Spark
Hi Preshant, Chester, Mohammed,
I switched to Spark's Akka and now it works well. Thanks for the help!
(Need to exclude Akka from Spray dependencies, or specify it as provided)
Jianshi
On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller moham...@glassbeam.com
wrote:
I am not sure about that.
Thanks Akhil. I tried changing /root/ephemeral-hdfs/conf/hdfs-site.xml to
have
property
namedfs.data.dir/name
value/vol,/vol0,/vol1,/vol2,/vol3,/vol4,/vol5,/vol6,/vol7,/mnt/ephemeral-hdfs/data,/mnt2/ephemeral-hdfs/data/value
/property
and then running
Harold,
just mentioning it in case you run into it: If you are in a separate
thread, there are apparently stricter limits to what you can and cannot
serialize:
val someVal
future {
// be very careful with defining RDD operations using someVal here
val myLocalVal = someVal
// use myLocalVal
AFAIK, you can read data from DB with JdbcRDD, but there is no interface
for writing to DB.
JdbcRDD has some restrict such as SQL must with where clause.
For writing to DB, you can use mapPartitions or foreachPartition to
implement.
You can refer this example:
79 matches
Mail list logo