I'm using Python to setup a dataframe, but for some reason it is not being made
available to SQL. Code (from Zeppelin) below. I don't get any error when
loading/prepping the data or dataframe. Any tips?
(Originally I was not hardcoding the Row() structure, as my other tutorial
added it by
Try SparkConf.set(spark.akka.extensions,Whatever), underneath i think
spark won't ship properties which don't start with spark.* to the executors.
Thanks
Best Regards
On Mon, May 11, 2015 at 8:33 AM, Terry Hole hujie.ea...@gmail.com wrote:
Hi all,
I'd like to monitor the akka using kamon,
Did you try repartitioning? You might end up with a lot of time spending on
GC though.
Thanks
Best Regards
On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar vijaypawnar...@gmail.com
wrote:
I am using the Spark Cassandra connector to work with a table with 3
million records. Using .where() API
Try this
Res = ssc.sql(your SQL without limit)
Print red.first()
Note: your SQL looks wrong as count will need a group by clause.
Best
Ayan
On 11 May 2015 16:22, Tyler Mitchell tyler.mitch...@actian.com wrote:
I'm using Python to setup a dataframe, but for some reason it is not
being made
Have a look over here https://storm.apache.org/community.html
Thanks
Best Regards
On Sun, May 10, 2015 at 3:21 PM, anshu shukla anshushuk...@gmail.com
wrote:
http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps
--
Thanks Regards,
Anshu Shukla
Hi All,
Thanks for suggestions. What I tried is -
hiveContext.sql (add jar ) and that helps to complete the create
temporary function but while using this function I get ClassNotFound for
the class handling this function. The same class is present in the jar
added .
Please note that the same
I try to get the result schema of aggregate functions using DataFrame
API.
However, I find the result field of groupBy columns are always nullable
even the source field is not nullable.
I want to know if this is by design, thank you! Below is the simple code
to show the issue.
==
import
Hi, Akhil,
I tried this. It did not work. I also tried SparkConf.set(akka.
extensions,[\kamon.system.SystemMetrics\, \kamon.statsd.StatsD\]), it
also did not work.
Thanks
On Mon, May 11, 2015 at 2:56 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Try
This is the stack trace of the worker thread:
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
I have one hdfs dir, which contains many files:
/user/root/1.txt
/user/root/2.txt
/user/root/3.txt
/user/root/4.txt
and there is a daemon process which add one file per minute to this dir.
(e.g., 5.txt, 6.txt, 7.txt...)
I want to start a spark streaming job which load 3.txt, 4.txt and then
hi,
it is possible to use a custom distance measure and a other data typ as
vector?
i want cluster temporal geo datas.
best regards
paul
Hi,
Are you creating the table from hive? Which version of hive are you using?
Thanks,
Daoyuan
-Original Message-
From: Nick Travers [mailto:n.e.trav...@gmail.com]
Sent: Sunday, May 10, 2015 10:34 AM
To: user@spark.apache.org
Subject: Spark SQL and java.lang.RuntimeException
I'm
Hi Paul,
I would say that it should be possible, but you'll need a different
distance measure which conforms to your coordinate system.
2015-05-11 14:59 GMT+02:00 Pa Rö paul.roewer1...@googlemail.com:
hi,
it is possible to use a custom distance measure and a other data typ as
vector?
i
Hi ,
I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
I need help to retrieve the Inner element data in the Structure below.
Below is the schema when I enter df.printSchema :
|-- THROTTLING_PERCENTAGE: double (nullable = false)
|-- IMPRESSION_TYPE: string (nullable = false)
I am running Spark jobs on YARN cluster. It took ~30 seconds to create a
spark context, while it takes only 1-2 seconds running Spark in local mode.
The master is set as yarn-client, and both the machine that submits the
Spark job and the YARN cluster are in the same domain.
Originally I
Thanks for the suggestion Ayan, it has not solved my problem but I did get
sqlContext to execute the SQL and return dataframe object. SQL is running fine
in the pyspark interpreter but not passing to SQL note (though it works fine
for a different dataset) - guess I'll take this question to the
Could you upload the spark assembly to HDFS and then set spark.yarn.jar to the
path where you uploaded it? That can help minimize start-up time. How long if
you start just a spark shell?
On 5/11/15, 11:15 AM, stanley wangshua...@yahoo.com wrote:
I am running Spark jobs on YARN cluster. It
I am building an analytics app with Spark. I plan to use long-lived
SparkContexts to minimize the overhead for creating Spark contexts, which in
turn reduces the analytics query response time.
The number of queries that are run in the system is relatively small each
day. Would long lived contexts
I want to start a child-thread in foreachRDD.
My situation is:
the job is reading from a hdfs dir continuously, and every 100 batches, I
want to launch a model training task (I will make a snapshot of the rdds at
that time and start the training task. the training task takes a very long
time(2
Typically you would use . notation to access, same way you would access a
map.
On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote:
Hi ,
I am trying to read Nested Avro data in Spark 1.3 using DataFrames.
I need help to retrieve the Inner element data in the Structure below.
In this example, every thing work expect save to parquet file.
On Mon, May 11, 2015 at 4:39 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
MyDenseVectorUDT do exist in the assembly jar and in this example all the
code is in a single file to make sure every thing is included.
On Tue, Apr 21,
MyDenseVectorUDT do exist in the assembly jar and in this example all the
code is in a single file to make sure every thing is included.
On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote:
You should check where MyDenseVectorUDT is defined and whether it was
on the classpath
It depends on how you want to run your application. You can always save 100
batch as a data file and run another app to read those files. In that case
you have separated contexts and you will find both application running
simultaneously in the cluster but on different JVMs. But if you do not want
I've had good success with splunk generator.
https://github.com/coccyx/eventgen/blob/master/README.md
On May 11, 2015, at 00:05, Akhil Das
ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com wrote:
Have a look over here https://storm.apache.org/community.html
Thanks
Best Regards
On
Hi,
I'm in Spark 1.3.0 and my data is in DataFrames.
I need operations like sampleByKey(), sampleByKeyExact().
I saw the JIRA Add approximate stratified sampling to DataFrame (
https://issues.apache.org/jira/browse/SPARK-7157).
That's targeted for Spark 1.5, till that comes through, whats the
Hi, devs,
I met a problem when using spark to read to parquet files with two
different versions of schemas. For example, the first file has one field
with int type, while the same field in the second file is a long. I
thought spark would automatically generate a merged schema long, and use
that
As we all know, a partition in Spark is actually an Iterator[T]. For some
purpose, I want to treat each partition not an Iterator but one whole
object. For example, treat Iterator[Int] to a
breeze.linalg.DenseVector[Int]. Thus I use 'mapPartitions' API to achieve
this, however, during the
Got it to work on the cluster by changing the master to yarn-cluster
instead of local! I do have a couple follow up questions...
This is the example I was trying to
Note that O'Reilly Media has test prep materials in development.
The exam does include questions in Scala, Python, Java, and SQL -- and
frankly a number of the questions are about comparing or identifying
equivalent Spark techniques between two of those different languages. The
questions do not
You can use '--jars ' option of spark-submit to ship metrics-core jar.
Cheers
On Mon, May 11, 2015 at 2:04 PM, Lee McFadden splee...@gmail.com wrote:
Thanks Ted,
The issue is that I'm using packages (see spark-submit definition) and I
do not know how to add com.yammer.metrics:metrics-core
Awesome!!
Thank you Mr. Nathan,
Great to have a guide like you and helping us all,
Regards,
Kartik
On May 11, 2015 5:07 PM, Paco Nathan cet...@gmail.com wrote:
Note that O'Reilly Media has test prep materials in development.
The exam does include questions in Scala, Python, Java, and SQL
Sean,
How does this model actually work? Let's say we want to run one job as N
threads executing one particular task, e.g. streaming data out of Kafka
into a search engine. How do we configure our Spark job execution?
Right now, I'm seeing this job running as a single thread. And it's quite a
BTW I think my comment was wrong as marcelo demonstrated. In
standalone mode you'd have one worker, and you do have one executor,
but his explanation is right. But, you certainly have execution slots
for each core.
Are you talking about your own user code? you can make threads, but
that's nothing
Relaying an answer from AMP director Mike Franklin:
One year into the lab we got a 5 yr Expeditions in Computing Award as part
of the White House Big Data initiative in 2012, so we extend the lab for a
year. We intend to start winding it down at the end of 2016, while
supporting existing
Thanks, Sean. This was not yet digested data for me :)
The number of partitions in a streaming RDD is determined by the
block interval and the batch interval. I have seen the bit on
spark.streaming.blockInterval
in the doc but I didn't connect it with the batch interval and the number
of
Since there is an array here you are probably looking for HiveQL's LATERAL
VIEW explode
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView
.
On Mon, May 11, 2015 at 7:12 AM, ayan guha guha.a...@gmail.com wrote:
Typically you would use . notation to access, same way you
BTW, I use spark 1.3.1, and already set
spark.sql.parquet.useDataSourceApi to false.
Schema merging is only supported when this flag is set to true (setting it
to false uses old code that will be removed once the new code is proven).
Temporary tables are not displayed by SHOW TABLES until Spark 1.3.
On Mon, May 11, 2015 at 12:54 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Hi,
How can I get a list of temporary tables via Thrift?
Have used thrift’s startWithContext and registered a temp table, but not
I believe the issue in b and c is that you call iter.size which actually is
going to flush the iterator so the subsequent attempt to put it into a
vector will yield 0 items. You could use an ArrayBuilder for example and
not need to rely on knowing the size of the iterator.
On Mon, May 11, 2015 at
That is mostly the YARN overhead. You're starting up a container for the AM
and executors, at least. That still sounds pretty slow, but the defaults
aren't tuned for fast startup.
On May 11, 2015 7:00 PM, Su She suhsheka...@gmail.com wrote:
Got it to work on the cluster by changing the master to
Looks like it is spending a lot of time doing hash probing. It could be a
number of the following:
1. hash probing itself is inherently expensive compared with rest of your
workload
2. murmur3 doesn't work well with this key distribution
3. quadratic probing (triangular sequence) with a
Hi Haopu,
actually here `key` is nullable because this is your input's schema :
scala result.printSchema
root
|-- key: string (nullable = true)
|-- SUM(value): long (nullable = true)
scala df.printSchema
root
|-- key: string (nullable = true)
|-- value: long (nullable = false)
I tried it with a
com.yammer.metrics.core.Gauge is in metrics-core jar
e.g., in master branch:
[INFO] | \- org.apache.kafka:kafka_2.10:jar:0.8.1.1:compile
[INFO] | +- com.yammer.metrics:metrics-core:jar:2.2.0:compile
Please make sure metrics-core jar is on the classpath.
On Mon, May 11, 2015 at 1:32 PM, Lee
Are there existing or under development versions/modules for streaming
messages out of RabbitMQ with SparkStreaming, or perhaps a RabbitMQ RDD?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-RabbitMQ-tp22852.html
Sent from the Apache Spark User
You have one worker with one executor with 32 execution slots.
On Mon, May 11, 2015 at 9:52 PM, dgoldenberg dgoldenberg...@gmail.com wrote:
Hi,
Is there anything special one must do, running locally and submitting a job
like so:
spark-submit \
--class com.myco.Driver \
Thanks Ted,
The issue is that I'm using packages (see spark-submit definition) and I do
not know how to add com.yammer.metrics:metrics-core to my classpath so
Spark can see it.
Should metrics-core not be part of
the org.apache.spark:spark-streaming-kafka_2.10:1.3.1 package so it can
work
Hi,
How can I get a list of temporary tables via Thrift?
Have used thrift's startWithContext and registered a temp table, but not seeing
the temp table/rdd when running show tables.
Thanks,
Judy
Thanks for catching this. I didn't read carefully enough.
It'd make sense to have the udaf result be non-nullable, if the exprs are
indeed non-nullable.
On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote:
Hi Haopu,
actually here `key` is nullable because this is your
Not by design. Would you be interested in submitting a pull request?
On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:
I try to get the result schema of aggregate functions using DataFrame
API.
However, I find the result field of groupBy columns are always nullable
even
Hey there,
I have installed a python interpreter in certain location, say
/opt/local/anaconda.
Is there anything that I can specify the Python interpreter while
developing in iPython notebook? Maybe a property in the while creating the
Sparkcontext?
I know that I can put #!/opt/local/anaconda
Hi,
We've been having some issues getting spark streaming running correctly
using a Kafka stream, and we've been going around in circles trying to
resolve this dependency.
Details of our environment and the error below, if anyone can help resolve
this it would be much appreciated.
Submit
Are you actually running anything that requires all those slots? e.g.,
locally, I get this with local[16], but only after I run something that
actually uses those 16 slots:
Executor task launch worker-15 daemon prio=10 tid=0x7f4c80029800
nid=0x8ce waiting on condition [0x7f4c62493000]
Understood. We'll use the multi-threaded code we already have..
How are these execution slots filled up? I assume each slot is dedicated to
one submitted task. If that's the case, how is each task distributed then,
i.e. how is that task run in a multi-node fashion? Say 1000 batches/RDD's
are
Ah yes, the Kafka + streaming code isn't in the assembly, is it? you'd
have to provide it and all its dependencies with your app. You could
also build this into your own app jar. Tools like Maven will add in
the transitive dependencies.
On Mon, May 11, 2015 at 10:04 PM, Lee McFadden
Creating dataframes and union them looks reasonable.
thanks,
Wei
On Mon, May 11, 2015 at 6:39 PM, Michael Armbrust mich...@databricks.com
wrote:
Ah, yeah sorry. I should have read closer and realized that what you are
asking for is not supported. It might be possible to add simple
I doubt that will make it as we are pretty slammed with other things and
the author needs to address the comments / merge conflict still.
I'll add that in general I recommend users use the HiveContext, even if
they aren't using Hive at all. Its a strict super set of the functionality
provided by
Seems to be running OK with 4 threads, 16 threads... While running with 32
threads I started getting the below.
15/05/11 19:48:46 WARN executor.Executor: Issue communicating with driver
in heartbeater
org.apache.spark.SparkException: Error sending message [message =
Note that `object` is equivalent to a class full of static fields / methods
(in Java), so the data it holds will not be serialized, ever.
What you want is a config class instead, so you can instantiate it, and
that instance can be serialized. Then you can easily do (1) or (3).
On Mon, May 11,
Ted, many thanks. I'm not used to Java dependencies so this was a real
head-scratcher for me.
Downloading the two metrics packages from the maven repository
(metrics-core, metrics-annotation) and supplying it on the spark-submit
command line worked.
My final spark-submit for a python project
Hi,
Can standalone cluster manager provide I/O information on worker nodes?
If not, possible to point out what's the proper file to modify to
achieve that functionality?
Besides, does Mesos support that?
Regards.
-
To
Can someone explain to me the difference between DStream union and
StreamingContext union?
When do you use one vs the other?
Thanks,
Vadim
ᐧ
Michael – Thanks for the response – that’s right, I haven’t noticed that Spark
Shell instantiates sqlContext as a HiveContext, not actual Spark SQL Context…
I’ve seen the PR to add STDDEV to data frames.. Can I expect this to be added
to Spark SQL in Spark 1.4 or it’s still uncertain? It would
Had the same question on stackoverflow recently
http://stackoverflow.com/questions/30008127/how-to-read-a-nested-collection-in-spark
Lomig Mégard had a detailed answer of how to do this without using LATERAL
VIEW.
On Mon, May 11, 2015 at 8:05 AM, Ashish Kumar Singh ashish23...@gmail.com
wrote:
Hi,
We have a 3-node master setup with ZooKeeper HA.
Driver can find the master with spark://xxx:xxx,xxx:xxx,xxx:xxx
But how can I find out the valid Master UI without looping through all 3
nodes?
Thanks
Ah, yeah sorry. I should have read closer and realized that what you are
asking for is not supported. It might be possible to add simple coercions
such as this one, but today, compatible schemas must only add/remove
columns and cannot change types.
You could try creating different dataframes
I opened a ticket on this (without posting here first - bad etiquette,
apologies) which was closed as 'fixed'.
https://issues.apache.org/jira/browse/SPARK-7538
I don't believe that because I have my script running means this is fixed,
I think it is still an issue.
I downloaded the spark source,
After upgrading to spark 1.3, these statements on hivecontext are working
fine. Thanks
On Mon, May 11, 2015, 12:15 Ravindra ravindra.baj...@gmail.com wrote:
Hi All,
Thanks for suggestions. What I tried is -
hiveContext.sql (add jar ) and that helps to complete the create
temporary
Also check out the spark.cleaner.ttl property. Otherwise, you will accumulate
shuffle metadata in the memory of the driver.
Sent with Good (www.good.com)
-Original Message-
From: Silvio Fiorito
[silvio.fior...@granturing.commailto:silvio.fior...@granturing.com]
Sent: Monday, May 11,
Hi,
I'm trying to compile and use Spark 1.3.1 with Hadoop 2.2.0. I compiled
from course with the Maven command:
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0
-Phive-thriftserver -DskipTests clean package
When I run this with a local master (bin/spark-shell --master local[2]) I
You want to look at dynamic resource allocation, here:
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
On 5/11/15, 11:23 AM, stanley wangshua...@yahoo.com wrote:
I am building an analytics app with Spark. I plan to use long-lived
SparkContexts to minimize
Hi,
I'm running TwitterPopularTags.scala on a single node.
Everything works fine for a while (about 30min),
but after a while I see a long processing delay for tasks, and it keeps
increasing.
Has anyone experienced the same issue?
Here is my configurations:
spark.driver.memory
71 matches
Mail list logo