HI All,
I have issues to make external jar available to Spark Shell
I have used -jars options while starting Spark Shell to make these
available
when I give command Class.forName(org.postgresql.Driver it is not giving
any error
But when action operation is performed on RDD than I am getting
HI All,
I have issues to make external jar available to Spark Shell
I have used -jars options while starting Spark Shell to make these
available
when I give command Class.forName(org.postgresql.Driver it is not giving
any error
But when action operation is performed on RDD than I am getting
Hi TD,
Thanks for elaboration. I have further doubts based on further test that I
did after your guidance
Case 1: Standalone Spark--
In standalone mode, as you explained,master in spark-submit local[*]
implicitly, so it uses as creates threads as the number of cores that VM
has, but User can
Hi,
I am getting the assertion error while trying to run build/sbt unidoc same as
you described in Building scaladoc using build/sbt unidoc failure .Could you
tell me how you get it working ?
| |
| | | | | |
| Building scaladoc using build/sbt unidoc failureHello,I am trying to build
Hi All,
I am trying to run a simple join on Hive through SparkShell on pseudo
cloudera cluster on ubuntu machine :
*val hc = new HiveContext(sc);*
*hc.sql(use testdb);*
But it is failing with the message :
org.apache.hadoop.hive.ql.parse.SemanticException: Database does not exist:
testdb
Hello everybody,
I'm running a two node spark cluster on ec2, created using the provided
scripts. I then ssh into the master and invoke
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook
--profile=pyspark' spark/bin/pyspark. This launches a spark notebook which
has been instructed
We have a streaming job that makes use of reduceByKeyAndWindow
https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L334-L341.
We want this to work with an initial state. The idea is to avoid losing
state if the
Thank you for your answer!
The problem is, I cannot ssh to the master directly.
I have to ssh first to a frontend, then I have to ssh to another frontend.
And only from this last frontend I can ssh to my master.
Can I do this by ssh -ing with -L to the first two frontends and to the
master?
And
Here's an example https://github.com/przemek1990/spark-streaming
Thanks
Best Regards
On Thu, Jul 9, 2015 at 4:35 PM, diplomatic Guru diplomaticg...@gmail.com
wrote:
Hello all,
I'm trying to configure the flume to push data into a sink so that my
stream job could pick up the data. My events
When you connect to the machines you can create an ssh tunnel to access the
UI :
ssh -L 8080:127.0.0.1:8080 MasterMachinesIP
And then you can simply open localhost:8080 in your browser and it should
show up the UI.
Thanks
Best Regards
On Thu, Jul 9, 2015 at 7:44 PM, rroxanaioana
It seems an issue with the azure, there was a discussion over here
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-spark-install/
Thanks
Best Regards
On Thu, Jul 9, 2015 at 9:42 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm running Spark 1.4 on
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Thanks
Best Regards
On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar vinodsachin...@gmail.com
wrote:
Hi Guys,
Can any one please share me how to use caching feature of spark via spark
sql queries?
-Vinod
Hi,
I've been experimenting with the Spark Word2Vec implementation in the
MLLib package.
It seems to me that only the preparatory steps are actually performed in
a distributed way, i.e. stages 0-2 that prepare the data. In stage 3
(mapPartitionsWithIndex at Word2Vec.scala:312), only one node seems
Hi,
I am beginner to spark , I want save the word and its count to cassandra
keyspace, I wrote the following code
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object SparkWordCount {
def
Thanks for the help Dean/TD,
I was able to cut the lineage with checkpointing with following code:
dstream.countByValue().foreachRDD((rdd, time) = {
val joined = rdd.union(current).reduceByKey(_+_, 2).leftOuterJoin(base)
val toUpdate = joined.filter(myfilter).map(mymap)
val
Hi,
I have the following problem, which is a kind of special case of k
nearest neighbours.
I have an Array of Vectors (v1) and an RDD[(Long, Vector)] of pairs of
vectors with indexes (v2). The array v1 easily fits into a single node's
memory (~100 entries), but v2 is very large (millions of
Hi,
I am a bit confused about the steps I need to take to start a Spark application
on a cluster.
So far I had this impression from the documentation that I need to explicitly
submit the application using for example spark-submit.
However, from the SparkContext constructur signature I get the
Hi,
I am running single spark-shell but observing this error when I give val sc =
new SparkContext(conf)
15/07/10 15:42:56 WARN AbstractLifeCycle: FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in
use
java.net.BindException: Address already in use
that's because sc is already initialized. You can do sc.stop() before you
initialize another one.
Thanks
Best Regards
On Fri, Jul 10, 2015 at 3:54 PM, Prateek . prat...@aricent.com wrote:
Hi,
I am running single spark-shell but observing this error when I give val
sc = new
UpdateStateByKey will run the update function on every interval, even if the
incoming batch is empty. Is there a way to prevent that? If the incoming
DStream contains no RDDs (or RDDs of count 0) then I don't want my update
function to run.
Note that this is different from running the update
Hi, i have a spark ML worklflow. It uses some persist calls. When i
launch it with 1 tb dataset - it puts down all cluster, becauses it
fills all disk space at /yarn/nm/usercache/root/appcache:
http://i.imgur.com/qvRUrOp.png
I found a yarn settings:
Hi Akhil, thank you for your reply. Does that mean that original Spark
Streaming only support Avro? If that the case then why only Avro? Is there
a particular reason?
The project linked is for Scala but I'm using Java. Is there another
project?
On 10 July 2015 at 08:46, Akhil Das
when I do run this command:
ashutosh@pas-lab-server7:~/spark-1.4.0$ ./bin/spark-submit \
--class org.apache.spark.graphx.lib.Analytics \
--master spark://172.17.27.12:7077 \
assembly/target/scala-2.10/spark-assembly-1.4.0-hadoop2.2.0.jar \
pagerank soc-LiveJournal1.txt --numEPart=100
Hi everyone,
I have planned to move mssql server to spark?. I have using around 50,000
to 1l records.
The spark performance is slow when compared to mssql server.
What is the best data base(Spark or sql) to store or retrieve data around
50,000 to 1l records ?
regards,
Ravi
I would strongly encourage you to read the docs at, they are very useful in
getting up and running:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md
For your use case shown above, you will need to ensure that you include the
appropriate version of the
Hi,
Thanks Todd..the link is really helpful to get started. ☺
-Prateek
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Friday, July 10, 2015 4:43 PM
To: Prateek .
Cc: user@spark.apache.org
Subject: Re: Saving RDD into cassandra keyspace.
I would strongly encourage you to read the docs at,
Thanks Akhil! I got it . ☺
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Friday, July 10, 2015 4:02 PM
To: Prateek .
Cc: user@spark.apache.org
Subject: Re: SelectChannelConnector@0.0.0.0:4040: java.net.BindException:
Address already in use when running spark-shell
that's because sc
spark-submit does a lot of magic configurations (classpaths etc) underneath
the covers to enable pyspark to find Spark JARs, etc. I am not sure how you
can start running things directly from the PyCharm IDE. Others in the
community may be able to answer. For now the main way to run pyspark stuff
Hey ,
Is there any *guarantee of fix ordering among the batches/RDDs* .
After searching a lot I found there is no ordering by default (from the
framework itself ) not only on *batch wise *but *also ordering within
batches* .But i doubt is there any change from old spark versions to spark
Hello again.
So I could compute triangle numbers when run the code from spark shell
without workers (with --driver-memory 15g option), but with workers I have
errors. So I run spark shell:
./bin/spark-shell --master spark://192.168.0.31:7077 --executor-memory
6900m --driver-memory 15g
and workers
Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?
Srikanth
On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha umesh.ka...@gmail.com wrote:
Hi Srikant thanks for the response. I have the following code:
hiveContext.sql(insert into... ).coalesce(6)
Above code does
Hello all,
In my lab a colleague installed and configured spark 1.3.0 on a 4 noded
cluster on CDH5.4 environment. The default port number for our spark
configuration is 7456. I have been trying to SSH to spark-master from using
this port number but it fails every time giving error JVM is timed
Thanks Ayan ,
I was curious to know* how Spark does it *.Is there any *Documentation*
where i can get the detail about that . Will you please point me out some
detailed link etc .
May be it does something like *transactional topologies in storm*.(
For hadoop 2.x :
tvf
~/2-hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/hadoop-mapreduce-client-core-2.8.0-SNAPSHOT.jar
| grep FileInputFormat.class
...
17552 Fri Apr 24 15:57:54 PDT 2015
org/apache/hadoop/mapreduce/lib/input/FileInputFormat.class
Hi Ravi,
First, Neither Spark nor Spark SQL is a database. Both are compute engines,
which need to be paired with a storage system. Seconds, they are designed for
processing large distributed datasets. If you have only 100,000 records or even
a million records, you don’t need Spark. A RDBMS
Are you talking about reduceByKeyAndWindow with or without inverse reduce?
TD
On Fri, Jul 10, 2015 at 2:07 AM, Imran Alam im...@newscred.com wrote:
We have a streaming job that makes use of reduceByKeyAndWindow
Hello,
I have a very specific question on how to do a search between particular
lines of log file. I did some research to find the answer and what I learned
is that if one of the shuffle operation applied to RDD, there is no a way to
reconstruct the sequence of line (except zipping with id). I'm
AFAIK, it is guranteed that batch t+1 will not start processing until batch
t is done.
ordeing within batch - what do you mean by that? In essence, the (mini)
batch will get distributed in partitions like a normal RDD, so following
rdd.zipWithIndex should give a wy to order them by the time they
Michael,
Thanks
- Terry
Michael Armbrust mich...@databricks.com于2015年7月11日星期六 04:02写道:
Metastore configuration should be set in hive-site.xml.
On Thu, Jul 9, 2015 at 8:59 PM, Terry Hole hujie.ea...@gmail.com wrote:
Hi,
I am trying to set the hive metadata destination to a mysql database
SSH by default should be on port 22. 7456 is the port is where master is
listening. So any spark app should be able to connect to master using that
port.
On 11 Jul 2015 13:50, ashishdutt ashish.du...@gmail.com wrote:
Hello all,
In my lab a colleague installed and configured spark 1.3.0 on a 4
Did you try it by adding the `_` after the method names to partially apply
them? Scala is saying that its trying to immediately apply those methods
but can't find arguments. But you instead are trying to pass them along as
functions (which they aren't). Here is a link to a stackoverflow answer
Quick and clear answer thank you.
2015-07-09 21:07 GMT+02:00 Nicholas Chammas nicholas.cham...@gmail.com:
No plans to change that at the moment, but agreed it is against accepted
convention. It would be a lot of work to change the tool, change the AMIs,
and test everything. My suggestion is
Thanks, Akhil.
We're trying the conf.setExecutorEnv() approach since we've already got
environment variables set. For system properties we'd go the
conf.set(spark.) route.
We were concerned that doing the below type of thing did not work, which
this blog post seems to confirm (
Hey, Guys!
I am using spark for NGS data application.
In my case I have to broadcast a very big dataset to each task.
However there are serveral tasks (say 48 tasks) running on cpus (also 48 cores)
in the same node. These tasks, who run on the same node, could share the same
dataset. But
Hello,
I'm trying to debug a PySpark app with Kafka Streaming in PyCharm. However,
PySpark cannot find the jar dependencies for Kafka Streaming without editing
the program. I can temporarily use SparkConf to set 'spark.jars', but I'm
using Mesos for production and don't want to edit my program
I'm using hadoop 2.5.2 with spark 1.4.0 and I can also see in my logs:
15/07/09 06:39:02 DEBUG HadoopRDD: SplitLocationInfo and other new Hadoop
classes are unavailable. Using the older Hadoop location info code.
java.lang.ClassNotFoundException:
Hi,
I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm
trying to get a DataFrame from a table, the Dataframe with a number X of
partitions (lets say 2). The code would be the following:
MapString, String options = new HashMapString, String();
options.put(url, DB_URL);
It looks like there is no problem with Tomcat 8.
On Fri, Jul 10, 2015 at 11:12 AM, Zoran Jeremic zoran.jere...@gmail.com
wrote:
Hi Ted,
I'm running Tomcat 7 with Java:
java version 1.8.0_45
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build
Hi All,
Today, I'm happy to announce SparkHub
(http://sparkhub.databricks.com), a service for the Apache Spark
community to easily find the most relevant Spark resources on the web.
SparkHub is a curated list of Spark news, videos and talks, package
releases, upcoming events around the world,
What version of Java is Tomcat run ?
Thanks
On Jul 10, 2015, at 10:09 AM, Zoran Jeremic zoran.jere...@gmail.com wrote:
Hi,
I've developed maven application that uses mongo-hadoop connector to pull
data from mongodb and process it using Apache spark. The whole process runs
smoothly
Why does this not work? Is insert into broken in 1.3.1? It does not throw
any errors, fail, or throw exceptions. It simply does not work.
val ssc = new StreamingContext(sc, Minutes(10))
val currentStream = ssc.textFileStream(ss3://textFileDirectory/)
val dayBefore =
Hi,
I want to write junit test cases in scala for testing spark application. Is
there any guide or link which I can refer.
Thank you very much.
-Naveen
Hi Ted,
I'm running Tomcat 7 with Java:
java version 1.8.0_45
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
Zoran
On Fri, Jul 10, 2015 at 10:45 AM, Ted Yu yuzhih...@gmail.com wrote:
What version of Java is Tomcat run ?
Unless you had something specific in mind, it should be as simple as
creating a SparkContext object using a master of local[2] in your tests
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:
Hi,
I want to write junit test cases in scala for testing spark
Yes, you can launch (from Java code) pyspark scripts with yarn-cluster mode
without using the spark-submit script.
Check SparkLauncher code in this link
https://github.com/apache/spark/tree/master/launcher/src/main/java/org/apache/spark/launcher
. SparkLauncher is not dependent on Spark core
Hi Jan,
Most SparkContext constructors are there for legacy reasons. The point of
going through spark-submit is to set up all the classpaths, system
properties, and resolve URIs properly *with respect to the deployment mode*.
For instance, jars are distributed differently between YARN cluster
Hi,
I've developed maven application that uses mongo-hadoop connector to pull
data from mongodb and process it using Apache spark. The whole process runs
smoothly if I run it on embedded Jetty server. However, if I deploy it to
Tomcat server 7, it's always interrupted at the line of code which
Hi Ashish,
Cool. glad it worked out. I have only used Spark clusters on EC2, which I
spin up using the spark-ec2 scripts (part of the Spark downloads). So don't
have any experience setting up inhouse clusters like you want to do. But I
found some documentation here that may be helpful.
Hi,
I'm trying to create a Spark Streaming actor stream but I'm having several
problems. First of all the guide from
https://spark.apache.org/docs/latest/streaming-custom-receivers.html refers
to the code
Hi Ashutosh, I believe the class is
org.apache.spark.*examples.*graphx.Analytics?
If you're running page rank on live journal you could just use
org.apache.spark.examples.graphx.LiveJournalPageRank.
-Andrew
2015-07-10 3:42 GMT-07:00 AshutoshRaghuvanshi
ashutosh.raghuvans...@gmail.com:
when I
I have installed the SparkR package from Spark distribution into the R
library. I can call the following command and it seems to work properly:
library(SparkR)
However, when I try to get the Spark context using the following code,
sc - sparkR.init(master=local)
It fails after some time with the
Hi I have Hive insert into query which creates new Hive partitions. I have
two Hive partitions named server and date. Now I execute insert into queries
using the following code and try to save it
DataFrame dframe = hiveContext.sql(insert into summary1
partition(server='a1',date='2015-05-22')
To add to this, conceptually, it makes no sense to launch something in
yarn-cluster mode by creating a SparkContext on the client - the whole
point of yarn-cluster mode is that the SparkContext runs on the cluster,
not on the client.
On Thu, Jul 9, 2015 at 2:35 PM, Marcelo Vanzin
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:
I want to write junit test cases in scala for testing spark application.
Is there any guide or link which I can refer.
https://spark.apache.org/docs/latest/programming-guide.html#unit-testing
Typically I create test
Metastore configuration should be set in hive-site.xml.
On Thu, Jul 9, 2015 at 8:59 PM, Terry Hole hujie.ea...@gmail.com wrote:
Hi,
I am trying to set the hive metadata destination to a mysql database in
hive context, it works fine in spark 1.3.1, but it seems broken in spark
1.4.1-rc1,
Somewhat biased of course, but you can also use spark-testing-base from
spark-packages.org as a basis for your unittests.
On Fri, Jul 10, 2015 at 12:03 PM, Daniel Siegmann
daniel.siegm...@teamaol.com wrote:
On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:
I want
When you say tasks, do you mean different applications, or different tasks in
the same application? If it's the same program, they should be able to share
the broadcasted value. But given you're asking the question, I imagine they're
separate.
And in that case, afaik, the answer is no. You
Hi, Ashic,
Thank you very much for your reply!
The tasks I mention is a running Function that I implemented with Spark API and
passed to each partition of a RDD. Within the Function I broadcast a big
variable to be queried by each partition.
So, When I am running on a 48 cores slave node. I
Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4
and higher from the official website.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p23770.html
Sent from the
I can +1 Holden's spark-testing-base package.
Burak
On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau hol...@pigscanfly.ca wrote:
Somewhat biased of course, but you can also use spark-testing-base from
spark-packages.org as a basis for your unittests.
On Fri, Jul 10, 2015 at 12:03 PM, Daniel
Hi,
initially today when moving my streaming application to the cluster the first
time I ran in to newbie error of using a local file system for checkpointing
and the RDD partition count differences (see exception below).
Having neither HDFS nor S3 (and the Cassandra-Connector not yet
Hi,
My spark job runs without error, but once it completes I get this message
and the app is logged as incomplete application in my spark-history :
SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See
Sorry, only indirectly Spark-related.
I've attempting to create a .NET proxy for spark-core, using JNI4NET. At the
moment I'm stuck with the following error when running the proxy generator:
java.lang.NoClassDefFoundError:
org.apache.hadoop.mapreduce.lib.input.FileInputFormat
I've resolved
No. Works perfectly.
On Fri, Jul 10, 2015 at 3:38 PM, liangdianpeng liangdianp...@vip.163.com
wrote:
if the class inside the spark_XXX.jar was damaged
发自网易邮箱手机版
On 2015-07-11 06:13 , Mulugeta Mammo mulugeta.abe...@gmail.com Wrote:
Hi,
My spark job runs without error, but once it
74 matches
Mail list logo