You can add this jar
http://central.maven.org/maven2/com/101tec/zkclient/0.3/zkclient-0.3.jar
in the classpath to get ride of this. If you are hitting further exceptions
like classNotFound for metrics* etc, then make sure you have all these jars
in the classpath:
What about start-all.sh or start-slaves.sh?
Thanks
Best Regards
On Tue, Oct 21, 2014 at 10:25 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
I'm working a cluster where I need to start the workers separately and
connect them to a master.
I'm following the instructions here and using
Hi,
what do you mean by pretty small ? How big is your file ?
Regards,
Olivier.
2014-10-21 6:01 GMT+02:00 Kevin Jung itsjb.j...@samsung.com:
I use Spark 1.1.0 and set these options to spark-defaults.conf
spark.scheduler.mode FAIR
spark.cores.max 48
spark.default.parallelism 72
Thanks,
I don't think this is provided out of the box, but you can use toSeq on
your Iterable and if the Iterable is lazy, it should stay that way for the
Seq.
And then you can use sc.parallelize(my-iterable.toSeq) so you'll have your
RDD.
For the Iterable[Iterable[T]] you can flatten it and then create
If you already know your keys the best way would be to extract
one RDD per key (it would not bring the content back to the master and you
can take advantage of the caching features) and then execute a
registerTempTable by Key.
But I'm guessing, you don't know the keys in advance, and in this
Hi All,
I am trying to run the spark example JavaDecisionTree code using some
external data set.
It works for certain dataset only with specific maxBins and maxDepth
settings. Even for a working dataset if I add a new data item I get a
ArrayIndexOutOfBounds Exception, I get the same exception
Thanks, Guilaume,
Below is when the exception happens, nothing has spilled to disk yet.
And there isn't a join, but a partitionBy and groupBy action.
Actually if numPartitions is small, it succeeds, while if it's large, it
fails.
Partition was simply done by
override def getPartition(key:
Hi!
The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception (full
stacktrace below):
java.lang.ClassCastException:
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRank$RankBuffer cannot be
cast to
Hi,
Is there a simple way to run spark sql queries against Sql Server databases? Or
are we limited to running sql and doing sc.Parallelize()? Being able to query
small amounts of lookup info directly from spark can save a bunch of annoying
etl, and I'd expect Spark Sql to have some way of doing
I have s3-compatible service and I'd like to have access to it in spark.
From what I have gathered, I need to add
s3service.s3-endpoint=my_s3_endpoint to file jets3t.properties in
classpath. I'm not java programmer and I'm not sure where to put it in
hello-world example.
I managed to make it
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL
server. Currently Spark SQL can't run queries against SQL server. The
foreign data source API planned in Spark 1.2 can make this possible.
On 10/21/14 6:26 PM, Ashic Mahtab wrote:
Hi,
Is there a simple way to run spark
Hi,
I am VERY new to spark and mllib and ran into a couple of problems while
trying to reproduce some examples. I am aware that this is a very simple
question but could somebody please give me an example
- how to create a RowMatrix in scala with the following entries:
[1 2
3 4]?
I would like to
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val personPath = /hdd/spark/person.json
val person = sqlContext.jsonFile(personPath)
person.printSchema()
person.registerTempTable(person)
val addressPath = /hdd/spark/address.json
val address = sqlContext.jsonFile(addressPath)
thank you
it works!akka timeout may be bottle-neck in my system
在 Oct 20, 2014,17:07,Akhil Das ak...@sigmoidanalytics.com 写道:
I used to hit this issue when my data size was too large and the number of
partitions was too large ( 1200 ), I got ride of it by
- Reducing the number of
thanks
i need check spark 1.1.0 contain it
在 Oct 21, 2014,0:01,DB Tsai dbt...@dbtsai.com 写道:
I ran into the same issue when the dataset is very big.
Marcelo from Cloudera found that it may be caused by SPARK-2711, so their
Spark 1.1 release reverted SPARK-2711, and the issue is gone.
Thanks. Didn't know about jdbcrdd...should do nicely for now. The foreign data
source api looks interesting...
Date: Tue, 21 Oct 2014 20:33:03 +0800
From: lian.cs@gmail.com
To: as...@live.com; user@spark.apache.org
Subject: Re: Getting Spark SQL talking to Sql Server
That's true Guillaume.
I'm currently aggregating documents considering a week as time range.
I will have to make it daily and aggregate the results later.
thanks for your hints anyway
Arian Pasquali
http://about.me/arianpasquali
2014-10-20 13:53 GMT+01:00 Guillaume Pitel
Hi Spark ! I found out why my RDD's werent coming through in my spark
stream.
It turns out you need the onStart() needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread. For example
def onStartMock():Unit ={
val future = new Thread(new
Collect will store the entire output in a List in memory. This solution is
acceptable for Little Data problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.
Most problems which benefit from Spark are large enough that even the data
Hi Tridib,
I changed SQLContext to HiveContext and it started working. These are steps
I used.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val person = sqlContext.jsonFile(json/person.json)
person.printSchema()
person.registerTempTable(person)
val address =
Hmm... I thought HiveContext will only worki if Hive is present. I am curious
to know when to use HiveContext and when to use SqlContext.
Thanks Regards
Tridib
--
View this message in context:
Hi,
I am creating a cassandra java rdd and transforming it using the where
clause.
It works fine when I run it outside the mapValues, but when I put the code
in mapValues I get an error while creating the transformation.
Below is my sample code:
CassandraJavaRDDReferenceData
No, analytic and window functions do not work yet.
On Tue, Oct 21, 2014 at 3:00 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi!
The RANK function is available in hive since version 0.11.
When trying to use it in SparkSQL, I'm getting the following exception
(full
Hi All!
I'm getting my feet wet with pySpark for the fairly boring case of
doing parameter sweeps for monte carlo runs. Each of my functions runs for
a very long time (2h+) and return numpy arrays on the order of ~100 MB.
That is, my spark applications look like
def foo(x):
what could cause this type of 'stage failure'? Thanks!
This is a simple py spark script to list data in hbase.
command line: ./spark-submit --driver-class-path
~/spark-examples-1.1.0-hadoop2.3.0.jar /root/workspace/test/sparkhbase.py
14/10/21 17:53:50 INFO BlockManagerInfo: Added
maybe set up a hbase.jar in the conf?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-Task-0-in-stage-0-0-failed-4-times-tp16928p16929.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Just to follow up, the queries worked against master and I got my whole flow
rolling. Thanks for the suggestion! Now if only Spark 1.2 will come out with
the next release of CDH5 :P
-Terry
From: Terry Siu terry@smartfocus.commailto:terry@smartfocus.com
Date: Monday, October 20, 2014
Hmm... I thought HiveContext will only worki if Hive is present. I am
curious
to know when to use HiveContext and when to use SqlContext.
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started
TLDR; Always use HiveContext if your application does not have a dependency
Hi all,
Can anyone tell me how to set the native library path in Spark.
Right not I am setting it using SPARK_LIBRARY_PATH environmental variable
in spark-env.sh. But still no success.
I am still seeing this in spark-shell.
NativeCodeLoader: Unable to load native-hadoop library for your
Thank for pointing that out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-join-sql-fails-after-sqlCtx-cacheTable-tp16893p16933.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Oh - and one other note on this, which appears to be the case.
If , in your stream forEachRDD implementation, you do something stupid
(like call rdd.count())
tweetStream.foreachRDD((rdd,lent)= {
tweetStream.repartition(1)
numTweetsCollected+=1;
//val count = rdd.count()
Hi,
What would be the best way to get percentiles from a Spark RDD? I can see
JavaDoubleRDD or MLlib's MultivariateStatisticalSummary
https://spark.apache.org/docs/latest/mllib-statistics.html provide the
mean() but not percentiles.
Thank you!
Horace
--
View this message in context:
Hi,
I'd like to run my python script using spark-submit together with a JAR
file containing Java specifications for a Hadoop file system. How can I do
that? It seems I can either provide a JAR file or a PYthon file to
spark-submit.
So far I have been running my code in ipython with
Any help? or comments?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
This is as much of a Scala question as a Spark question
I have an RDD:
val rdd1: RDD[(Long, Array[Long])]
This RDD has duplicate keys that I can collapse such
val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b)
If I start with an Array of primitive longs in rdd1, will rdd2
Hi All,I have a question regarding the ordering of indices. The document says
that the indices indices are one-based and in ascending order. However, do the
indices within a row need to be sorted in ascending order?
Sparse dataIt is very common in practice to have sparse training data. MLlib
Ok thanks Michael.
In general, what's the easy way to figure out what's already implemented?
The exception I was getting was not really helpful here?
Also, is there a roadmap document somewhere ?
Thanks!
P.
--
View this message in context:
Thanks for the help!
Hadoop version: 2.3.0
Hbase version: 0.98.1
Use python to read/write data from/to hbase.
Only change over the official spark 1.1.0 is the pom file under examples.
Compilation:
spark:mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
A rather more general question is - assume I have an JavaRDDK which is
sorted -
How can I convert this into a JavaPairRDDInteger,K where the Integer is
tie index - 0...N - 1.
Easy to do on one machine
JavaRDDK values = ... // create here
JavaRDDInteger,K positions =
maven cache is laid out differently but it does work on Linux and BSD/mac.
Still looks like a hack to me.
On Oct 21, 2014, at 1:28 PM, Pat Ferrel p...@occamsmachete.com wrote:
Doesn’t this seem like a dangerous error prone hack? It will build different
bits on different machines. It doesn’t
I am running a simple rdd filter command. What does it mean?
Here is the full stack trace(and code below it):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 133
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
Hi all, I tried to use the function SchemaRDD.where() but got some error:
val people = sqlCtx.sql(select * from people)
people.where('age === 10)
console:27: error: value === is not a member of Symbol
where did I go wrong?
Thanks,
Kevin Paul
this is the stack trace I got with yarn logs -applicationId
really no idea where to dig further.
thanks!
yang
14/10/21 14:36:43 INFO ConnectionManager: Accepted connection from [
phxaishdc9dn1262.stratus.phx.ebay.com/10.115.58.21]
14/10/21 14:36:47 ERROR Executor: Exception in task ID 98
You need to import sqlCtx._ to get access to the implicit conversion.
On Tue, Oct 21, 2014 at 2:40 PM, Kevin Paul kevinpaulap...@gmail.com
wrote:
Hi all, I tried to use the function SchemaRDD.where() but got some error:
val people = sqlCtx.sql(select * from people)
people.where('age ===
Just posted below for a similar question.
Have you seen this thread ?
http://search-hadoop.com/m/JW1q5ezXPH/KryoException%253A+Buffer+overflowsubj=RE+spark+nbsp+kryo+serilizable+nbsp+exception
On Tue, Oct 21, 2014 at 2:44 PM, Yang tedd...@gmail.com wrote:
this is the stack trace I got
Hi,
I want to ingest Open Street Map. It's 43GB (compressed) XML in BZIP2
format. What's your advice for reading it in to an RDD?
BTW, the Spark Training at UMD is awesome! I'm having a blast learning
Spark. I wish I could go to the MeetUp tonight, but I have kid activities...
Is there any specific issues you are facing?
Thanks,
Yin
On Tue, Oct 21, 2014 at 4:00 PM, tridib tridib.sama...@live.com wrote:
Any help? or comments?
--
View this message in context:
Yes. where the indices are one-based and **in ascending order**. -Xiangrui
On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I have a question regarding the ordering of indices. The document says that
the indices indices are one-based and in ascending order.
Set up the spark port to a different one and the connection seems successful
but get a 302 to /proxy on port 8100 ? Nothing is listening on that port as
well.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ui-redirecting-to-port-8100-tp16956.html
Please check out the example code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala
-Xiangrui
On Tue, Oct 21, 2014 at 5:34 AM, viola viola.wiersc...@siemens.com wrote:
Hi,
I am VERY new to spark and mllib and ran into a
Great, I will sort them.
Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone
div Original message /divdivFrom: Xiangrui Meng
men...@gmail.com /divdivDate:10/21/2014 3:29 PM (GMT-08:00)
/divdivTo: Sameer Tilak ssti...@live.com /divdivCc:
user@spark.apache.org
Hi John,
Glad you're enjoying the Spark training at UMD.
Is the 43 GB XML data in a single file or split across multiple BZIP2 files?
Is the file in a HDFS cluster or on a single linux machine?
If you're using BZIP2 with splittable compression (in HDFS), you'll need at
least Hadoop 1.1:
Hi Sadhan,
Which port are you specifically trying to redirect? The driver program has
a web UI, typically on port 4040... or the Spark Standalone Cluster Master
has a UI exposed on port 7077.
Which setting did you update in which file to make this change?
And finally, which version of Spark are
Hello,
Spark 1.1.0, Hadoop 2.4.1
I have written a Spark streaming application. And I am getting
FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
Here is brief what I am is trying to do.
My application is creating text file stream using Java Stream context. The
input file is
Hi,
Can you post what the error looks like?
Sameer F.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Usage-of-spark-ec2-how-to-deploy-a-revised-version-of-spark-1-1-0-tp16943p16963.html
Sent from the Apache Spark User List mailing list archive at
Hi, this sounds like a bug which has been fixed in the current master.
What version of Spark are you using? Would it be possible to update to the
current master?
If not, it would be helpful to know some more of the problem dimensions
(num examples, num features, feature types, label type).
Hi Shailesh,
Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.
This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a
Hi there,
I'm using Spark 1.1.0 and experimenting with trying to use the DataStax
Cassandra Connector (https://github.com/datastax/spark-cassandra-connector)
from within PySpark.
As a baby step, I'm simply trying to validate that I have access to classes
that I'd need via Py4J. Sample python
It seems that ++ does the right thing on arrays of longs, and gives you another
one:
scala val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)
scala val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)
scala a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)
scala res0.getClass
Thanks Sameer for quick reply.
I will try to implement it.
Shailesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html
Sent from the Apache Spark User List mailing list archive at
It was mainly because spark was setting the jar classes in a thread local
context classloader. The quick fix was to make our serde use the context
classloader first.
--
View this message in context:
Thanks to folks here for the suggestions. I ended up settling on what seems
to be a simple and scalable approach. I am no longer using
sparkContext.textFiles with wildcards (it is too slow when working with a
large number of files). Instead, I have implemented directory traversal as
a Spark job,
Yes, I am unable to use jsonFile() so that it can detect date type
automatically from json data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-tp16881p16974.html
Sent from the Apache
Hi,
I have been trying to find a fairly complex application that makes use of
the Spark Streaming framework. I checked public github repos but the
examples I found were too simple, only comprising simple operations like
counters and sums. On the Spark summit website, I could find very
interesting
Hi, I am using the latest calliope library from tuplejump.com to create RDD
for cassandra table.
I am on a 3 nodes spark 1.1.0 with yarn.
My cassandra table is defined as below and I have about 2000 rows of data
inserted.
CREATE TABLE top_shows (
program_id varchar,
view_minute timestamp,
Is this because I am calling a transformation function on an rdd from
inside another transformation function?
Is it not allowed?
Thanks
Ankut
On Oct 21, 2014 1:59 PM, Ankur Srivastava ankur.srivast...@gmail.com
wrote:
Hi Gerard,
this is the code that may be helpful.
public class
Add one more thing about question 1. Once you get the SchemaRDD from
jsonFile/jsonRDD, you can use CAST(columnName as DATE) in your query to
cast the column type from the StringType to DateType (the string format
should be -[m]m-[d]d and you need to use hiveContext). Here is the
code snippet
You can resort to |SQLContext.jsonFile(path: String, samplingRate:
Double)| and set |samplingRate| to 1.0, so that all the columns can be
inferred.
You can also use |SQLContext.applySchema| to specify your own schema
(which is a |StructType|).
On 10/22/14 5:56 AM, Harivardan Jayaraman
Looks like the only way is to implement that feature. There is no way of
hacking it into working
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-tp15758p16985.html
Sent from the Apache Spark User
you ran out of kryo buffer. are you using spark 1.1 (which supports buffer
resizing) or spark 1.0 (which has a fixed size buffer)?
On Oct 21, 2014 5:30 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am running a simple rdd filter command. What does it mean?
Here is the full stack trace(and code
Hi Joseph
I am using spark 1.1.0 the latest version, I will try to update to the
current master and check.
The example I am running is JavaDecisionTree, the dataset is of libsvm
format containing
1. 45 instances of training sample.
2. 5 features
3. I am not sure what is feature type, but
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can
no longer set the number of executors or executor-cores. No matter what
values I pass on the command line to spark they are overwritten by the
defaults. Does anyone have any idea what could have happened here? Running
on
Hi all ,
I have a large data in text files (1,000,000 lines) .Each line has 128
columns . Here each line is a feature and each column is a dimension.
I have converted the txt files in json format and able to run sql queries
on json files using spark.
Now i am trying to build a k dimenstion
Thanks for the quick response. However, I still only get error messages. I am
able to load a .txt file with entries in it and use it in sparks, but I am
not able to create a simple matrix, for instance a 2x2 row matrix
[1 2
3 4]
I tried variations such as
val RowMatrix =
74 matches
Mail list logo