Hi Daniel,
As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage
SDK for Java (com.microsoft.azure:azure-storage). Even though the WASB driver
is built for 2.7, I was still able to use the hadoop-azure JAR with Spark built
for older Hadoop versions, back to 2.4 I
/apache/spark/SparkContext.scala#L1053
And the HDFS FileInputFormat implementation, that seems like a good option to
try.
You should be able to call conf.setLong(FileInputFormat.SPLIT_MAXSIZE, max).
I hope that helps!
From: ÐΞ€ρ@Ҝ (๏̯͡๏)
Date: Thursday, June 25, 2015 at 5:49 PM
To: Silvio Fiorito
Cc
To: Silvio Fiorito
Cc: user
Subject: Re: how to increase parallelism ?
What that did was run a repartition with 174 tasks
repartition with 174 tasks
AND
actual .filter.map stage with 500 tasks
It actually doubled to stages.
On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito
silvio.fior
Hi Deepak,
Parallelism is controlled by the number of partitions. In this case, how many
partitions are there for the details RDD (likely 170).
You can check by running “details.partitions.length”. If you want to increase
parallelism you can do so by repartitioning, increasing the number of
a
mapPartitions or one of the other combineByKey APIs?
From: Jianguo Li
Date: Tuesday, June 23, 2015 at 9:46 AM
To: Silvio Fiorito
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: workaround for groupByKey
Thanks. Yes, unfortunately, they all need to be grouped. I guess I can
Yes, just put the Cassandra connector on the Spark classpath and set the
connector config properties in the interpreter settings.
From: Mohammed Guller
Date: Monday, June 22, 2015 at 11:56 AM
To: Matthew Johnson, shahid ashraf
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE:
perhaps?
From: Jianguo Li
Date: Monday, June 22, 2015 at 6:21 PM
To: Silvio Fiorito
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: workaround for groupByKey
Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey.
I read in the Learning Sparking
We can
Hi Gerard,
Yes, you have to implement your own custom Metrics Source using the Code Hale
library. See here for some examples:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala
Sorry, replied to Gerard’s question vs yours.
See here:
Yes, you have to implement your own custom Metrics Source using the Code Hale
library. See here for some examples:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala
If you look at your streaming app UI you should see how many tasks are executed
each batch and on how many executors. This is dependent on the batch duration
and block interval, which defaults to 200ms. So every block interval a
partition will be generated. You can control the parallelism by
Are you sure you were using all 100 executors even with the receiver model?
Because in receiver mode, the number of partitions is dependent on the batch
duration and block interval. It may not necessarily map directly to the number
of executors in your app unless you've adjusted the block
Hi, just answered in your other thread as well...
Depending on your requirements, you can look at the updateStateByKey API
From: Nipun Arora
Date: Wednesday, June 17, 2015 at 10:51 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Iterative Programming by keeping data across
Depending on your requirements, you can look at the updateStateByKey API
From: Nipun Arora
Date: Wednesday, June 17, 2015 at 10:48 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: no subject
Hi,
Is there anyway in spark streaming to keep data across multiple micro-batches?
That’s due to the config setting spark.sql.shuffle.partitions which defaults to
200
From: Masf
Date: Thursday, May 28, 2015 at 10:02 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Dataframe Partitioning
Hi.
I have 2 dataframe with 1 and 12 partitions respectively. When I do
This is added to 1.4.0
https://github.com/apache/spark/pull/5762
On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote:
Hi,
wouldn't df.rdd.partitionBy() return a new RDD that I would then need to
make into a Dataframe again? Maybe like this:
Could you upload the spark assembly to HDFS and then set spark.yarn.jar to the
path where you uploaded it? That can help minimize start-up time. How long if
you start just a spark shell?
On 5/11/15, 11:15 AM, stanley wangshua...@yahoo.com wrote:
I am running Spark jobs on YARN cluster. It
You want to look at dynamic resource allocation, here:
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
On 5/11/15, 11:23 AM, stanley wangshua...@yahoo.com wrote:
I am building an analytics app with Spark. I plan to use long-lived
SparkContexts to minimize
If you’re using multiple masters with ZooKeeper then you should set your master
URL to be
spark://host01:7077,host02:7077
And the property spark.deploy.recoveryMode=ZOOKEEPER
See here for more info:
http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper
Hi James,
If you’re on Spark 1.3 you can use the kill command in spark-submit to shut it
down. You’ll need the driver id from the Spark UI or from when you submitted
the app.
spark-submit --master spark://master:7077 --kill driver-id
Thanks,
Silvio
From: James King
Date: Wednesday, May 6,
I think you'd probably want to look at combineByKey. I'm on my phone so can't
give you an example, but that's one solution i would try.
You would then take the resulting RDD and go back to a DF if needed.
From: bipinmailto:bipin@gmail.com
Sent: 4/29/2015
on the streaming side it really
helps simplify running tests on batch outputs.
If you’re having serialization issues you may need to look at using transient
lazy initializers, to see if that helps?
From: Michal Michalski
Date: Tuesday, April 28, 2015 at 11:42 AM
To: Silvio Fiorito
Cc: user
Subject: Re: Best
.
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 9:53 AM
To: Silvio Fiorito, user
Subject: Re: Spark partitioning question
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe too
If you need to keep the keys, you can use aggregateByKey to calculate an avg of
the values:
val step1 = data.aggregateByKey((0.0, 0))((a, b) = (a._1 + b, a._2 + 1), (a,
b) = (a._1 + b._1, a._2 + b._2))
val avgByKey = step1.mapValues(i = i._1/i._2)
Essentially, what this is doing is passing an
Hi Michal,
Please try spark-testing-base by Holden. I’ve used it and it works well for
unit testing batch and streaming jobs
https://github.com/holdenk/spark-testing-base
Thanks,
Silvio
From: Michal Michalski
Date: Tuesday, April 28, 2015 at 11:32 AM
To: user
Subject: Best practices on
to the first statement in
Silvio's answer (below). But my
reasoning to deconstruct and reconstruct is missing something.
Thanks again!
On 04/28/2015 11:26 AM, Silvio Fiorito wrote:
If you need to keep the keys, you can use aggregateByKey to calculate an avg of
the values:
val step1
Have you tried Apache Daemon?
http://commons.apache.org/proper/commons-daemon/procrun.html
From: Wang, Ningjun (LNG-NPV)
Date: Tuesday, March 10, 2015 at 11:47 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Is it possible to use windows service to start and stop spark
Great, glad it worked out!
From: Todd Nist
Date: Thursday, February 19, 2015 at 9:19 AM
To: Silvio Fiorito
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL + Tableau Connector
Hi Silvio,
I got this working today using your suggestion with the Initial SQL and a
Custom
on yarn you need to first go to the resource manager UI, find your job, and
click the link for the UI there.
From: Puneet Kumar Ojhamailto:puneet.ku...@pubmatic.com
Sent: ?Saturday?, ?February? ?14?, ?2015 ?5?:?25? ?AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Hi,
I am running 3
One method I’ve used is to publish each batch to a message bus or queue with a
custom UI listening on the other end, displaying the results in d3.js or some
other app. As far as I’m aware there isn’t a tool that will directly take a
DStream.
Spark Notebook seems to have some support for
Hey Todd,
I don’t have an app to test against the thrift server, are you able to define
custom SQL without using Tableau’s schema query? I guess it’s not possible to
just use SparkSQL temp tables, you may have to use permanent Hive tables that
are actually in the metastore so Tableau can
Hi Todd,
What you could do is run some SparkSQL commands immediately after the Thrift
server starts up. Or does Tableau have some init SQL commands you could run?
You can actually load data using SQL, such as:
create temporary table people using org.apache.spark.sql.json options (path
/*')
;
Time taken: 0.34 seconds
spark-sql select * from people;
NULLMichael
30 Andy
19 Justin
NULLMichael
30 Andy
19 Justin
Time taken: 0.576 seconds
From: Todd Nist
Date: Tuesday, February 10, 2015 at 6:49 PM
To: Silvio Fiorito
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject
Nick,
Have you tried https://github.com/kaitoy/pcap4j
I’ve used this in a Spark app already and didn’t have any issues. My use case
was slightly different than yours, but you should give it a try.
From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org
Date: Friday, January 16, 2015 at
Hi Manjul,
Each StreamingContext will have its own batch size. If that doesn’t work
for the different sources you have then you would have to create different
streaming apps. You can only create a new StreamingContext in the same
Spark app, once you’ve stopped the previous one.
Spark
Is this through a streaming app? I've done this before by publishing results
out to a queue our message bus, with a web app listening on the other end. If
it's just batch or infrequent you could save the results out to a file.
From:
Batches will wait for the previous batch to finish. The monitoring console will
show you the backlog of waiting batches.
From: Asim Jalis asimja...@gmail.commailto:asimja...@gmail.com
Date: Friday, December 19, 2014 at 1:16 PM
To: user user@spark.apache.orgmailto:user@spark.apache.org
Subject:
Hi Pierce,
You shouldn’t have to use groupByKey because updateStateByKey will get a
Seq of all the values for that key already.
I used that for realtime sessionization as well. What I did was key my
incoming events, then send them to udpateStateByKey. The updateStateByKey
function then
It’s on Maven Central already http://search.maven.org/#browse%7C717101892
On 12/18/14, 2:09 PM, Al M alasdair.mcbr...@gmail.com wrote:
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that
we
are
the expected output, thank you!
On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Ok, I have a better idea of what you’re trying to do now.
I think the prob might be the map. The first time the function runs,
currentValue will be None. Using map on None
If you no longer need to maintain state for a key, just return None for that
value and it gets removed.
From: Sunil Yarram yvsu...@gmail.commailto:yvsu...@gmail.com
Date: Friday, December 12, 2014 at 9:44 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Geovani,
You can use HiveContext to do inserts into a Hive table in a Streaming app just
as you would a batch app. A DStream is really a collection of RDDs so you can
run the insert from within the foreachRDD. You just have to be careful that
you’re not creating large amounts of small files.
You can use saveAsTable or do an INSERT SparkSQL statement as well in case
you need other Hive query features, like partitioning.
On 9/2/14, 6:54 AM, centerqi hu cente...@gmail.com wrote:
I got it
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
val hiveContext =
What Spark and Hadoop versions are you on? I have it working in my Spark
app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar.
I¹m running Spark 1.0.2 and CDH5.
bin/spark-shell --master local[*] --driver-class-path
~/parquet-hive-bundle-1.5.0.jar
To see if that works?
On
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on
CDH5. The jobs will run for about 34-37 hours then die due to this
FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x
cores, 2 x executors, 4g memory, YARN cluster mode.
Here’s the stack
Thanks, I’ll go ahead and disable that setting for now.
From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com
Date: Wednesday, August 20, 2014 at 3:20 PM
To: Silvio Fiorito
silvio.fior...@granturing.commailto:silvio.fior...@granturing.com
Cc: user@spark.apache.orgmailto:user
You need to create a custom receiver that submits the HTTP requests then
deserializes the data and pushes it into the Streaming context.
See here for an example:
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
On 8/18/14, 6:20 PM, bumble123 tc1...@att.com wrote:
I want to
First the JAR needs to be deployed using the ‹jars argument. Then in your
HQL code you need to use the DeprecatedParquetInputFormat and
DeprecatedParquetOutputFormat as described here
https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0
.12
This is because SparkSQL is based
There's really nothing special besides including that jar on your classpath.
You just do selects, inserts, etc as you normally would.
The same instructions here apply
https://cwiki.apache.org/confluence/display/Hive/Parquet
From:
If you're using HiveContext then all metadata is in the Hive metastore as
defined in hive-site.xml.
Concurrent writes should be fine as long as you're using a concurrent metastore
db.
From: Flavio Pompermaiermailto:pomperma...@okkam.it
Sent: 8/16/2014 1:26 PM
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was
include the parquet-hive-bundle-1.5.0.jar on my classpath.
From: lycmailto:yanchen@huawei.com
Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
You also need to ensure you're using checkpointing and support recreating the
context on driver failure as described in the docs here:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node
From: Matt Narrell
Using the SchemaRDD insertInto method, is there any way to support partitions
on a field in the RDD? If not, what's the alternative, register a table and do
an insert into via SQL statement? Any plans to support partitioning via
insertInto? What other options are there for inserting into a
Yin Michael,
Thanks, I'll try the SQL route.
From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com
Date: Wednesday, August 13, 2014 at 5:04 PM
To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com
Cc: Silvio Fiorito
silvio.fior...@granturing.commailto:silvio.fior
Is there a repo somewhere with the code for the Hive dependencies (hive-exec,
hive-serde, hive-metastore) used in SparkSQL? Are they forked with
Spark-specific customizations, like Shark, or simply relabeled with a new
package name (org.spark-project.hive)? I couldn't find any repos on Github
- Patrick
On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Is there a repo somewhere with the code for the Hive dependencies
(hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked
with Spark-specific customizations, like Shark, or simply
101 - 155 of 155 matches
Mail list logo