spark-ec2 is the way to go however you may need to debug connectivity issues.
For example do you know that the servers were correctly setup in AWS and can
you access each node using ssh? If no then you need to work out why (it’s not a
spark issue). If yes then you will need to work out why ssh
The ultimate aim of my program is to be able to wrap an arbitrary Scala
function (mostly will be statistics / customized rolling window metrics) in
a UDF and evaluate them on DataFrames using the window functionality.
So my main question is how do I express that a UDF takes a Frame of rows
from a
Hi Jeetendra,
I faced this issue. I did not specify the database where this table exists.
Please set the database by using use database command before executing the
query.
Regards,
Ishwardeep
From: Jeetendra Gangele gangele...@gmail.com
Sent: Monday,
I am new to SPARK streaming. I was running the kafka_wordcount example with
a local KAFKA and SPARK instance. It was very easy to set this up and get
going :)I tried running both SCALA and Python versions of the word count
example. Python versions seems to be extremely slow. Sometimes it has
Thanks,
but I think this is not the case of multiple spark contexts (never the less I
tried your suggestion - didn’t worked). The problem is join to datasets using
array items value: attribute.value in my case. Has anyone ideas?
24 авг. 2015 г., в 15:01, satish chandra j
Hi,
If you join logic is correct, it seems to be a similar issue which i faced
recently
Can you try by
*SparkContext(conf).set(spark.driver.allowMultipleContexts,true)*
Regards,
Satish Chandra
On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov i.kar...@cleverdata.ru wrote:
Hi, guys
I'm confused
Hi,
I've had some troubles developing a Specs2 matcher that checks that a
predicate holds for all the elements of an RDD, and using it for testing a
simple Spark Streaming program. I've finally been able to get a code that
works, you can see it in
So what is the best way to deploy spark cluster in EC2 environment any
suggestions?
Garry
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Friday, August 21, 2015 4:27 PM
To: Garry Chen g...@cornell.edu
Cc: user@spark.apache.org
Subject: Re: Spark ec2 lunch problem
It may happen that
In your example, a.attributes.name is a list and is not a string . Run this
to find it out :
a.select($a.attributes.name).show()
On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov i.kar...@cleverdata.ru wrote:
Hi, guys
I'm confused about joining columns in SparkSQL and need your advice.
I want to
The first job is to infer the json schema, and the second one is what you mean
of the query.
You can provide the schema while loading the json file, like below:
sqlContext.read.schema(xxx).json(“…”)?
Hao
From: Jeff Zhang [mailto:zjf...@gmail.com]
Sent: Monday, August 24, 2015 6:20 PM
To:
System properties and environment variables are two different things.. One
can use spark.executor.extraJavaOptions to pass system properties and
spark-env.sh to pass environment variables.
-raghav
On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat hemant9...@gmail.com
wrote:
That's surprising.
And be sure the hive-site.xml is under the classpath or under the path of
$SPARK_HOME/conf
Hao
From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in]
Sent: Monday, August 24, 2015 8:57 PM
To: user
Subject: Re: Loading already existing tables in spark shell
Hi Jeetendra,
I faced
I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores,
Additional info...If I use an online md5sum check then it matches...So,
it's either windows or python (using 2.7.10)
On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com
wrote:
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
seen this on two
For now, user-defined window function is not supported. We will add it in
future.
On Mon, Aug 24, 2015 at 6:26 AM, xander92 alexander.fra...@ompnt.com
wrote:
The ultimate aim of my program is to be able to wrap an arbitrary Scala
function (mostly will be statistics / customized rolling window
I found this solution:
https://stackoverflow.com/questions/3390484/python-hashlib-md5-differs-between-linux-windows
Does anybody see a reason why I shouldn't put in a PR to make this change?
FROM
with open(tgz_file_path) as tar:
TO
with open(tgz_file_path, rb) as tar:
On Mon, Aug 24, 2015 at
Thats not the expected behavior. What version of Spark?
On Mon, Aug 24, 2015 at 1:32 AM, Kevin Jung itsjb.j...@samsung.com wrote:
When I store DataFrame as table with command saveAsTable and then
execute DROP TABLE in SparkSQL, it doesn't actually delete files in hive
warehouse.
The table
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
seen this on two different machines. I am running on windows, but I would
imagine that shouldn't affect the md5. Is this a boto problem, python
problem, spark problem?
--
View this message in context:
Here is the answer to my question if somebody needs it
Running Spark in Standalone mode or coarse-grained Mesos mode leads to
better task launch times than the fine-grained Mesos mode.
The resource is
http://spark.apache.org/docs/latest/streaming-programming-guide.html
On Mon, Aug 24, 2015
textFile is a lazy operation. It doesn't evaluate until you call an action
on it, such as .count(). Therefore, you won't catch the exception there.
Best,
Burak
On Mon, Aug 24, 2015 at 9:09 AM, Roberto Coluccio
roberto.coluc...@gmail.com wrote:
Hello folks,
I'm experiencing an unexpected
Forgot to include the PR I was referencing:
https://github.com/apache/spark/pull/4805/
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Direct-Streaming-With-ZK-Updates-tp24423p24424.html
Sent from the Apache Spark User List mailing list archive at
It doesn't matter if shuffling occurs. Just update ZK from the driver,
inside the foreachRDD, after all your dynamodb updates are done. Since
you're just doing it for monitoring purposes, that should be fine.
On Mon, Aug 24, 2015 at 12:11 PM, suchenzang suchenz...@gmail.com wrote:
Forgot to
Dear Cody,
Thanks for your response, I am trying to do decoration which means when a
message comes from Kafka (partitioned by key) in to the Spark I want to add
more fields/data to it.
How Does normally people do it in Spark? If it were you how would you decorate
message without hitting
The laziness is hard to deal with in these situations. I would suggest
trying to handle expected cases FileNotFound, etc using other methods
before even starting a Spark job. If you really want to try.catch a
specific portion of a Spark job, one way is to just follow it with an
action. You can
Can you please remove me from this distribution list?
(Filling up my inbox too fast)
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Monday, August 24, 2015 2:13 PM
To: Philip Weaver philip.wea...@gmail.com
Cc: Jerrick Hoang jerrickho...@gmail.com; Raghavendra Pandey
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
If I go into the guest spark-shell and refer to the file thus, it works fine
val words=sc.textFile(hdfs:///tmp/people.txt)
words.count
However if I try to access it from a local Spark app on my Windows host, it
doesn't
@Michael: would listStatus calls read the actual parquet footers within the
folders?
On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com
wrote:
Can you please remove me from this distribution list?
(Filling up my inbox too fast)
*From:* Michael Armbrust
No, starting with Spark 1.5 we should by default only be reading the
footers on the executor side (that is unless schema merging has been
explicitly turned on).
On Mon, Aug 24, 2015 at 12:20 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:
@Michael: would listStatus calls read the actual parquet
Hi Aurelien,
The first code should create a new RDD in memory at each iteration (check
the webui).
The second code will unpersist the RDD but that's not the main problem.
I think you have trouble due to long lineage as .cache() keep track of
lineage for recovery.
You should have a look at
which are the best practices to submit spark streaming application on mesos.
I would like to know about scheduler mode.
Is `coarse-grained` mode right solution?
Thanks
When I store DataFrame as table with command saveAsTable and then execute
DROP TABLE in SparkSQL, it doesn't actually delete files in hive warehouse.
The table disappears from a table list but the data files are still alive.
Because of this, I can't saveAsTable with a same name before dropping
It's weird to me that the simple show function will cost 2 spark jobs.
DataFrame#explain shows it is a very simple operation, not sure why need 2
jobs.
== Parsed Logical Plan ==
Relation[age#0L,name#1]
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json]
==
Hi,
Is there any function to find the determinant of a mllib.linalg.Matrix
(a covariance matrix) using Spark?
Regards,
Naveen
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Hi,
I have a spark standalone cluster with 100s of applications per day, and it
changes size (more or less workers) at various hours. The driver runs on a
separate machine outside the spark cluster.
When a job is running and it's worker is killed (because at that hour the
number of workers is
That's surprising. Passing the environment variables using
spark.executor.extraJavaOptions=-Dmyenvvar=xxx to the executor and then
fetching them using System.getProperty(myenvvar) has worked for me.
What is the error that you guys got?
On Mon, Aug 24, 2015 at 12:10 AM, Sathish Kumaran Vairavelu
Hi, guys
I'm confused about joining columns in SparkSQL and need your advice.
I want to join 2 datasets of profiles. Each profile has name and array of
attributes(age, gender, email etc).
There can be mutliple instances of attribute with the same name, e.g. profile
has 2 emails - so 2 attributes
Hi Aurelien,
The first code should create a new RDD in memory at each iteration (check
the webui).
The second code will unpersist the RDD but that's not the main problem.
I think you have trouble due to long lineage as .cache() keep track of
lineage for recovery.
You should have a look at
HI All,
Please find fix info for users who are following the mail chain of this
issue and the respective solution below:
*reduceByKey: Non working snippet*
import org.apache.spark.Context
import org.apache.spark.Context._
import org.apache.spark.SparkConf
val conf = new SparkConf()
val sc = new
Hi,
I am working on streaming application.
I tried to configure history server to persist the events of application in
hadoop file system (hdfs). However, it is not logging any events.
I am running Apache Spark 1.4.1 (pyspark) under Ubuntu 14.04 with three
nodes.
Here is my configuration:
File -
Hi ,
I am using SPARK 1.4 and I am getting an array out of bound Exception when I am
trying to read from a registered table in SPARK.
For example If I have 3 different text files with the content as below:
Scenario 1:
A1|B1|C1
A2|B2|C2
Scenario 2:
A1| |C1
A2| |C2
Scenario 3:
A1| B1|
A2| B2|
Just to add you can also look into SPARK_WORKER_INSTANCES configuration in
the spark-env.sh file.
On Aug 17, 2015 3:44 AM, Daniel Darabos daniel.dara...@lynxanalytics.com
wrote:
Hi Praveen,
On Mon, Aug 17, 2015 at 12:34 PM, praveen S mylogi...@gmail.com wrote:
What does this mean in
Follow the directions here: http://spark.apache.org/community.html
On Mon, Aug 24, 2015 at 11:36 AM, Sereday, Scott scott.sere...@nielsen.com
wrote:
Can you please remove me from this distribution list?
(Filling up my inbox too fast)
*From:* Michael Armbrust
Hi guys and gals,
I have a Spark 1.2.0 instance running that I connect to via the thrift
interface using beeline. On this instance I can send a command like `show
tables like 'tmp*';` and I get a list of all tables that start with `tmp`.
When testing this same command out on a server that is
Hi,
I am using spark 1.4 M1 with the Cassandra Connector and run into a strange
error when using the spark shell.
This works:
sc.cassandraTable(events,
bid_events).select(bid,type).take(10).foreach(println)
But as soon as I put a map() in there (or filter):
sc.cassandraTable(events,
I have setup up apache mesos using mesosphere on Cent OS 6 with Java 8.I have
3 slaves which total to 3 cores and 8 gb ram. I have set no firewalls. I am
trying to run the following lines of code to test whether the setup is
working:
val data = 1 to 1
val distData = sc.parallelize(data)
I set up iPython Notebook to work with the pyspark shell, and now I'd like
use %run to basically 'spark-submit' another Python Spark file, and leave
the objects accessible within the Notebook.
I tried this, but got a ValueError: Cannot run multiple SparkContexts at
once error. I then tried
This top line here is indicating that the exception is being throw from
your code (i.e. code written in the console).
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:40)
Check to make sure that you are properly handling data
The scala version of the Kafka is something that we have been working on
for a while, and is likely to be more optimized than the python one. The
python one definitely requires pass the data back and forth between JVM and
Python VM and decoding the raw bytes to the Python strings (probably less
Hello,
I'm trying to run a spark 1.5 job with:
./spark-shell --driver-java-options -Xdebug
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044 -Xms16g
-Xmx48g -Xss128m
I get lots of error messages like :
15/08/24 20:24:33 INFO ExternalSorter: Thread 172 spilling in-memory map of
Hey Garry,
Have you verified that your particular VPC and subnet are open to the
world? In particular, have you verified the route table attached to your
VPC / subnet contains an internet gateway open to the public?
I've run into this issue myself recently and that was the problem for me.
Continuing this discussion:
http://apache-spark-user-list.1001560.n3.nabble.com/same-log4j-slf4j-error-in-spark-9-1-td5592.html
I am getting this error when I use logback-classic.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
Hi Utkarsh,
Unfortunately that's not going to be easy. Since Spark bundles all
dependent classes into a single fat jar file, to remove that
dependency you'd need to modify Spark's assembly jar (potentially in
all your nodes). Doing that per-job is even trickier, because you'd
probably need some
We're going to be upgrading from spark 1.0.2 and using hadoop-1.2.1 so need
to build by hand. (Yes, I know. Use hadoop-2.x but standard resource
constraints apply.) I want to build against scala-2.11 and publish to our
artifact repository but finding build/spark-2.10.4 and tracing down what
I've used the instructions and it worked fine.
Can you post exactly what you're doing, and what it fails with? Or are you
just trying to understand how it works?
2015-08-24 15:48 GMT-04:00 Lanny Ripple la...@spotright.com:
Hello,
The instructions for building spark against scala-2.11
The property scala-2.11 triggers the profile scala-2.11 -- and
additionally disables the scala-2.10 profile, so that's the way to do
it. But yes, you also need to run the script before-hand to set up the
build for Scala 2.11 as well.
On Mon, Aug 24, 2015 at 8:48 PM, Lanny Ripple
Changing the ip to the guest IP address just never connects.
The VM has port tunnelling, and it passes through all the main ports,
8020 included to the host VM.
You can tell that it was talking to the guest VM before, simply
because it said when file not found
Error is:
Exception in thread
When you launch your HDP guest VM, most likely it gets launched with NAT
and an address on a private network (192.168.x.x) so on your windows host
you should use that address (you can find out using ifconfig on the guest
OS).
I usually add an entry to my /etc/hosts for VMs that I use oftenif
Hello,
The instructions for building spark against scala-2.11 indicate using
-Dspark-2.11. When I look in the pom.xml I find a profile named
'spark-2.11' but nothing that would indicate I should set a property. The
sbt build seems to need the -Dscala-2.11 property set. Finally build/mvn
does a
Move your count operation outside the foreach and use a broadcast to access
it inside the foreach.
On Aug 17, 2015 10:34 AM, Priya Ch learnings.chitt...@gmail.com wrote:
Looks like because of Spark-5063
RDD transformations and actions can only be invoked by the driver, not
inside of other
Hi Marcelo,
When I add this exclusion rule to my pom:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.4.1/version
exclusions
exclusion
groupIdorg.slf4j/groupId
When updating the ZK offset in the driver (within foreachRDD), there is
somehow a serialization exception getting thrown:
15/08/24 15:45:40 ERROR JobScheduler: Error in job generator
java.io.NotSerializableException: org.I0Itec.zkclient.ZkClient
at
Hi Cheng,
I know that sqlContext.read will trigger one spark job to infer the schema.
What I mean is DataFrame#show cost 2 spark jobs. So overall it would cost 3
jobs.
Here's the command I use:
val df =
I get the same error even when I set the SPARK_CLASSPATH: export
SPARK_CLASSPATH=/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.1.jar:/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar
And I run the job like this:
http://hortonworks.com/blog/windows-explorer-experience-hdfs/
Seemed to exist, now now sign.
Anything similar to tie HDFS into windows explorer?
Thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
Sent
Hi Utkarsh,
A quick look at slf4j's source shows it loads the first
StaticLoggerBinder in your classpath. How are you adding the logback
jar file to spark-submit?
If you use spark.driver.extraClassPath and
spark.executor.extraClassPath to add the jar, it should take
precedence over the log4j
Much appreciated! I am not comparing with select count(*) for
performance, but it was one simple thing I tried to check the performance
:). I think it now makes sense since Spark tries to extract all records
before doing the count. I thought having an aggregated function query
submitted over
That didn't work since extraClassPath flag was still appending the jars
at the end, so its still picking the slf4j jar provided by spark.
Although I found this flag: --conf spark.executor.userClassPathFirst=true
(http://spark.apache.org/docs/latest/configuration.html) and tried this:
➜ simspark
Can you show the complete stack trace ?
Which Spark / Kafka release are you using ?
Thanks
On Mon, Aug 24, 2015 at 4:58 PM, Cassa L lcas...@gmail.com wrote:
Hi,
I am storing messages in Kafka using protobuf and reading them into
Spark. I upgraded protobuf version from 2.4.1 to 2.5.0. I got
On Mon, Aug 24, 2015 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:
That didn't work since extraClassPath flag was still appending the jars at
the end, so its still picking the slf4j jar provided by spark.
Out of curiosity, how did you verify this? The extraClassPath
options are
Hi,
I am storing messages in Kafka using protobuf and reading them into Spark.
I upgraded protobuf version from 2.4.1 to 2.5.0. I got
java.lang.UnsupportedOperationException for older messages. However, even
for new messages I get the same error. Spark does convert it though. I see
my messages.
I assumed that's the case beacause of the error I got and the documentation
which says: Extra classpath entries to append to the classpath of the
driver.
This is where I stand now:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
There are many such kind of case class or concept such as
Attribute/AttributeReference/Expression in Spark SQL
I would ask what Attribute/AttributeReference/Expression mean, given a sql
query like select a,b from c, it a, b are two Attributes? a + b is an
expression?
Looks I misunderstand it
Any help would be appreciated
On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
My question was how to do this in Hadoop? Could somebody point me to some
examples?
On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Of course, Java or
CCing the mailing list again.
It's currently not on the radar. Do you have a use case for it? I can bring
it up during 1.6 roadmap planning tomorrow.
On Mon, Aug 24, 2015 at 8:28 PM, alexis GILLAIN ila...@hotmail.com wrote:
Hi,
I just realized the article I mentioned is cited in the jira and
Hao,
I can reproduce it using the master branch. I'm curious why you cannot
reproduce it. Did you check if the input HadoopRDD did have two partitions?
My test code is
val df = sqlContext.read.json(examples/src/main/resources/people.json)
df.show()
Best Regards,
Shixiong Zhu
2015-08-25 13:01
Hi Jeff, which version are you using? I couldn’t reproduce the 2 spark jobs in
the `df.show()` with latest code, we did refactor the code for json data source
recently, not sure you’re running an earlier version of it.
And a known issue is Spark SQL will try to re-list the files every time when
I'm trying to build Spark 1.4 with Java 7 and despite having that as my
JAVA_HOME, I get
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
spark-launcher_2.10 ---
[INFO] Using zinc server for incremental compilation
[info] Compiling 8 Java sources to
Because defaultMinPartitions is 2 (See
https://github.com/apache/spark/blob/642c43c81c835139e3f35dfd6a215d668a474203/core/src/main/scala/org/apache/spark/SparkContext.scala#L2057
), your input people.json will be split to 2 partitions.
At first, `take` will start a job for the first partition.
Thanks, Michael.
I discovered it myself. Finally, it was not a bug from Spark.
I have two HDFS cluster and Hive uses hive.metastore.warehouse.dir +
fs.defaultFS(HDFS1) for saving internal tables and also reference a default
database URI(HDFS2) in DBS table from metastore.
It may not be a
I'd start off by trying to simplify that closure - you don't need the
transform step, or currOffsetRanges to be scoped outside of it. Just do
everything in foreachRDD. LIkewise, it looks like zkClient is also scoped
outside of the closure passed to foreachRDD
i.e. you have
zkClient = new
Yes, check the source code under:
https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst
From: Todd [mailto:bit1...@163.com]
Sent: Tuesday, August 25, 2015 1:01 PM
To: user@spark.apache.org
Subject: Test case for the spark sql catalyst
Hi, Are
I think you could try sorting the endPointsCount and then doing a take.
This should be a distributed process and only the result would get returned
to the driver.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
Check out Reifier at Spark Summit 2015
Okay but how? thats what I am trying to figure out ? Any command you would
suggest?
Sent from my iPhone, plaese excuse any typos :)
On Aug 21, 2015, at 11:45 PM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
You get the list of all the persistet rdd using spark context...
On
I was running a Spark Job to crunch a 9GB apache log file When I saw the
following error:
15/08/25 04:25:16 WARN scheduler.TaskSetManager: Lost task 99.0 in stage 37.0
(TID 4115, ip-10-150-137-100.ap-southeast-1.compute.internal):
ExecutorLostFailure (executor 29 lost)15/08/25 04:25:16 INFO
Hi, Are there test cases for the spark sql catalyst, such as testing the rules
of transforming unsolved query plan?
Thanks!
I am using the steps from this article
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to
get spark up and running on EMR through yarn. Once up and running I ssh in
and cd to the spark bin and run spark-shell --master yarn. Once this spins
up I can see that the UI is started
86 matches
Mail list logo