Hi all,
I'm trying to run the standalone application SimpleApp.scala following the
instructions on the
http://spark.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
I was able to create a .jar file by doing sbt package. However when I tried
to do
$
Hello,
I tested some custom udf on SparkSql's ThriftServer Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string
type).
But my udf returning map type throws an error:
Error: scala.MatchError: interface java.util.Map (of class java.lang.Class)
- I see these in my mapper only task.
-
- *Input Size / Records: *68.0 GB / 577195178
- *Shuffle write: *95.1 GB / 282559291
- *Shuffle spill (memory): *2.8 TB
- *Shuffle spill (disk): *90.3 GB
I understand the first one, can someone give 1/2 liners for the next three
? also
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?
-Original Message-
From: ogoh [mailto:oke...@gmail.com]
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws rror:
scala.MatchError: interface
Hi,
org.apache.spark.mllib.linalg.Vector =
(1048576,[35587,884670],[3.458767233,3.458767233])
it is sparse vector representation of terms
so the first term(1048576) is the length of vector
[35587,884670] is the index of words
[3.458767233,3.458767233] are the tf-idf values of the terms.
Thanks
I see this
Is this a problem with my code or the cluster ? Is there any way to fix it ?
FetchFailed(BlockManagerId(2, phxdpehdc9dn2441.stratus.phx.ebay.com,
59574), shuffleId=1, mapId=80, reduceId=20, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to
Hello,
I have a question with Spark 1.4 ml library. In the copy function, it is
stated that the default implementation doesn't work of Params doesn't work
for models. (
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Model.scala#L49
)
As a result, some
Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my operation
is on columns, e.g., I need to create many intermediate variables from
different columns, what is the most efficient way to do this?
For example, if my dataRDD[Array[String]] is like below:
123, 523, 534, ..., 893
Hi Lee,
You should be able to create a PairRDD using the Nonce as the key, and the
AnalyticsEvent as the value. I'm very new to Spark, but here is some
uncompilable pseudo code that may or may not help:
events.map(event = (event.getNonce, event)).reduceByKey((a, b) =
a).map(_._2)
The above code
My current example doesn't use a Hive UDAF, but you would do something
pretty similar (it calls a new user defined UDAF, and there are wrappers to
make Spark SQL UDAFs from Hive UDAFs but they are private). So this is
doable, but since it pokes at internals it will likely break between
versions
But compile scope is supposed to be added to the assembly.
https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 algermissen1...@icloud.com
wrote:
Hi Iulian,
On 26 May 2015, at 13:04, Iulian
You can use it as a broadcast variable, but if it's too large (more than
1Gb I guess), you may need to share it joining this using some kind of key
to the other RDDs.
But this is the kind of thing broadcast variables were designed for.
Regards,
Olivier.
Le jeu. 4 juin 2015 à 23:50, dgoldenberg
Hi there!
I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the
spark-csv package. I want to load this data in parallel utilizing all 20
executors that I have, however by default only 3 executors are being used
(which downloaded 5gb/5gb/4gb).
Here is my script (im using
Is the dictionary read-only?
Did you look at
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables ?
-Original Message-
From: dgoldenberg [mailto:dgoldenberg...@gmail.com]
Sent: Thursday, June 04, 2015 4:50 PM
To: user@spark.apache.org
Subject: How to share
For the first round, you will have 16 reducers working since you have
32 partitions. Two of 32 partitions will know which reducer they will
go by sharing the same key using reduceByKey.
After this step is done, you will have 16 partitions, so the next
round will be 8 reducers.
Sincerely,
DB
Hi there,
I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives the
functionality you are looking for.
I haven't tested it though.
BR
On 5 June 2015 at 01:35, Olivier Girardot ssab...@gmail.com wrote:
You can use it as a broadcast variable, but
Thanks so much, Yiannis, Olivier, Huang!
On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
Hi there,
I would recommend checking out
https://github.com/spark-jobserver/spark-jobserver which I think gives
the functionality you are looking for.
I haven't tested it
We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.
What would be the best strategy for sharing such resources among the
transformations/actions within a consumer? Can they be shared somehow
across the RDD's?
I'm looking for a way to
I think if you create a bidirectional mapping from AnalyticsEvent to
another type that would wrap it and use the nonce as its equality, you
could then do something like reduceByKey to group by nonce and map back to
AnalyticsEvent after.
On Thu, Jun 4, 2015 at 1:10 PM, lbierman
Yeah - We don't have support for running UDFs on DataFrames yet. There is
an open issue to track this https://issues.apache.org/jira/browse/SPARK-6817
Thanks
Shivaram
On Thu, Jun 4, 2015 at 3:10 AM, Daniel Emaasit daniel.emaa...@gmail.com
wrote:
Hello Shivaram,
Was the includePackage()
By default, the depth of the tree is 2. Each partition will be one node.
Sincerely,
DB Tsai
---
Blog: https://www.dbtsai.com
On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar raghav0110...@gmail.com wrote:
Hey Reza,
Thanks for your response!
Deenar,
Thanks for the suggestion.
That is one of the ideas that I have, but didn’t get chance to try it out yet.
One of the things that could potentially cause problems is that we use wide
rows. In addition, the schema is dynamic, with new columns getting added on a
regular basis. That is why
Hey Reza,
Thanks for your response!
Your response clarifies some of my initial thoughts. However, what I don't
understand is how the depth of the tree is used to identify how many
intermediate reducers there will be, and how many partitions are sent to
the intermediate reducers. Could you
vm.swappiness=0? Some vendors recommend this set to 0 (zero), although I've
seen this causes even kernel to fail to allocate memory.
It may cause node reboot. If that's the case, set vm.swappiness to 5-10 and
decrease spark.*.memory. Your spark.driver.memory+
spark.executor.memory + OS + etc
The issue was that SSH key generated on Spark Master was not transferred to
this new slave. Spark-ec2 script with `start` command omits this step. The
solution is to use `launch` command with `--resume` options. Then the SSH
key is transferred to the new slave and everything goes smooth.
--
Thanks! It is working fine now with spark-submit. Just out of curiosity,
how would you use org.apache.spark.deploy.yarn.Client? Adding that
spark_yarn jar to the configuration inside the application?
On Thu, Jun 4, 2015 at 6:37 PM, Vova Shelgunov vvs...@gmail.com wrote:
You should run it with
That might work, but there might also be other steps that are required.
-Sandy
On Thu, Jun 4, 2015 at 11:13 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Thanks! It is working fine now with spark-submit. Just out of curiosity,
how would you use org.apache.spark.deploy.yarn.Client? Adding that
Additionally, I think this document (
https://spark.apache.org/docs/latest/building-spark.html ) should mention
that the protobuf.version might need to be changed to match the one used in
the chosen hadoop version. For instance, with hadoop 2.7.0 I had to change
protobuf.version to 1.5.0 to be
Hi - I'm having similar problem with switching from ephemeral to persistent
HDFS - it always looks for 9000 port regardless of options I set for 9010
persistent HDFS. Have you figured out a solution? Thanks
--
View this message in context:
I'm still a bit new to Spark and am struggilng to figure out the best way to
Dedupe my events.
I load my Avro files from HDFS and then I want to dedupe events that have
the same nonce.
For example my code so far:
JavaRDDAnalyticsEvent events = ((JavaRDDAvroKeylt;AnalyticsEvent)
Hi DB,
Yes I am running count() operations on the previous steps and it appears that
something is slow prior to the scaler. I thought that running take(5) and print
the results would execute the command at each step and materialize the RDD, but
is that not the case? That’s how I was testing
I would try setting PYSPARK_DRIVER_PYTHON environment variable to the
location of your python binary, especially if you are using a virtual
environment.
-Don
On Wed, Jun 3, 2015 at 8:24 PM, AlexG swift...@gmail.com wrote:
I have libskylark installed on both machines in my two node cluster in
take(5) will only evaluate enough partitions to provide 5 elements
(sometimes a few more but you get the idea), so it won't trigger a full
evaluation of all partitions unlike count().
On Thursday, June 4, 2015, Piero Cinquegrana pcinquegr...@marketshare.com
wrote:
Hi DB,
Yes I am running
It is possible to start multiple concurrent drivers, Spark dynamically
allocates ports per spark application on driver, master, and workers from
a port range. When you collect results back to the driver, they do not go
through the master. The master is mostly there as a coordinator between the
I talked to Don outside the list and he says that he's seeing this issue
with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a
real issue here.
On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote:
As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I
Hi all!,
I have a .txt file where each row of it it¹s a collection of terms of a
document separated by space. For example:
1 Hola spark²
2 ..
I followed this example of spark site
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get
something like this:
tfidf.first()
I have a spark app that reads avro sequence file data and performs join,
reduceByKey
Results:
Command for all runs:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
Thanks for the confirmation! We're quite new to Spark, so a little
reassurance is a good thing to have sometimes :-)
The thing that's concerning me at the moment is that my job doesn't seem to
run any faster with more compute resources added to the cluster, and this
is proving a little tricky to
So a few updates. When I run local as stated before, it works fine. When I
run in Yarn (via Apache Myriad on Mesos) it also runs fine. The only issue
is specifically with Mesos. I wonder if there is some sort of class path
goodness I need to fix or something along that lines. Any tips would be
Hi
2015-06-04 15:29 GMT+02:00 James Aley james.a...@swiftkey.com:
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
Hi,
I encountered a performance issue when join 3 tables in sparkSQL.
Here is the query:
SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, t_zipcode z, click_meter_site_grouped g
WHERE c.refCategoryID = g.category AND z.regionCode = g.region
I need to pay a
Hello,
I am relatively new to spark and I am currently trying to understand how to
scale large numbers of jobs with spark.
I understand that spark architecture is split in Driver, Master and
Workers. Master has a standby node in case of failure and workers can scale
out.
All the examples I have
Mohammed
Have you tried registering your Cassandra tables in Hive/Spark SQL using
the data frames API. These should be then available to query via the Spark
SQL/Thrift JDBC Server.
Deenar
On 1 June 2015 at 19:33, Mohammed Guller moham...@glassbeam.com wrote:
Nobody using Spark SQL
Hi Yin,
I’m very surprised to hear that its not supported in 1.3 because I’ve been
using it since 1.3.0.
It worked great up until SPARK-6908 was merged into master.
What is the supported way to get DF for a table that is not in the default
database ?
IMHO, If you are not going to support
Hi all,
I'm running Spark on AWS EMR and I'm having some issues getting the correct
permissions on the output files using
rdd.saveAsTextFile('file_dir_name'). In hive, I would add a line in the
beginning of the script with
set fs.s3.canned.acl=BucketOwnerFullControl
and that would set the
Are you using RC4?
On Wed, Jun 3, 2015 at 10:58 PM, Night Wolf nightwolf...@gmail.com wrote:
Thanks Yin, that seems to work with the Shell. But on a compiled
application with Spark-submit it still fails with the same exception.
On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com
Hi all,
I am new to spark. I am trying to deploy HDFS (hadoop-2.6.0) and Spark-1.3.1
with four nodes, and each node has 8-cores and 8GB memory.
One is configured as headnode running masters, and 3 others are workers
But when I try to run the Pagerank from HiBench, it always cause a node to
Hi
Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183
[image: Inline image 1]
Thanks
Best Regards
On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I've got a Spark Streaming driver job implemented and in it, I register a
streaming
Hi,
I set spark.shuffle.io.preferDirectBufs to false in SparkConf and this
setting can be seen in web ui's environment tab. But, it still eats memory,
i.e. -Xmx set to 512M but RES grows to 1.5G in half a day.
On Wed, Jun 3, 2015 at 12:02 PM, Shixiong Zhu zsxw...@gmail.com wrote:
Could you
You should not call `jssc.stop(true);` in a StreamingListener. It will
cause a dead-lock: `jssc.stop` won't return until `listenerBus` exits. But
since `jssc.stop` blocks `StreamingListener`, `listenerBus` cannot exit.
Best Regards,
Shixiong Zhu
2015-06-04 0:39 GMT+08:00 dgoldenberg
Hi,
I am using Hive 0.14 and spark 0.13. I got
java.lang.NullPointerException when inserted into hive. Any suggestions
please.
hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE +
,z= + zz + ,year= + YEAR + ,month= + MONTH + ) +
select date, hh, x, y, height, u, v,
Thanks Yin, that seems to work with the Shell. But on a compiled
application with Spark-submit it still fails with the same exception.
On Thu, Jun 4, 2015 at 2:46 PM, Yin Huai yh...@databricks.com wrote:
Can you put the following setting in spark-defaults.conf and try again?
Hi,
I've worked out how to use explode on my input avro dataset with the
following structure
root
|-- pageViewId: string (nullable = false)
|-- components: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- name: string (nullable = false)
|||--
Hi Brandon, they are available, but private to ml package. They are now
public in 1.4. For 1.3.1 you can define your transformer in
org.apache.spark.ml package - then you could use these traits.
Thanks,
Peter Rudenko
On 2015-06-04 20:28, Brandon Plaster wrote:
Is HasInputCol and HasOutputCol
Hi Doug,
sqlContext.table does not officially support database name. It only
supports table name as the parameter. We will add a method to support
database name in future.
Thanks,
Yin
On Thu, Jun 4, 2015 at 8:10 AM, Doug Balog doug.sparku...@dugos.com wrote:
Hi Yin,
I’m very surprised to
Got it. Ignore my similar question on Github comments.
On Thu, Jun 4, 2015 at 11:48 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Yeah - We don't have support for running UDFs on DataFrames yet. There is
an open issue to track this
Hey DB,
Thanks for the reply!
I still don't think this answers my question. For example, if I have a
top() action being executed and I have 32 workers(32 partitions), and I
choose a depth of 4, what does the overlay of intermediate reducers look
like? How many reducers are there excluding the
Replace this line:
img_data = sc.parallelize( list(im.getdata()) )
With:
img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have )
Thanks
Best Regards
On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote:
Hi all,
I'm playing around with
Hi,
I have used weka machine learning library for generating a model for my
training set. I have used the PART algorithm (decision lists) from weka.
Now, I would like to use spark ML for the PART algo for my training set and
could not seem to find a parallel. Could anyone point out the
That's because you need to add the master's public key (~/.ssh/id_rsa.pub)
to the newly added slaves ~/.ssh/authorized_keys.
I add slaves this way:
- Launch a new instance by clicking on the slave instance and choose *launch
more like this *
*- *Once its launched, ssh into it and add the master
Hi Matei,
thank you for answering.
Accordingly to what you said, am I mistaken when I say that tuples with the
same key might eventually be spread across more than one node in case an
overloaded worker can no longer accept tuples?
In other words, suppose a worker (processing key K) cannot accept
Hello,
*(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)*
This morning, I was looking to resolve the Failed to locate the winutils
binary in the hadoop binary path error.
I noticed that I can solve it configuring my build.sbt to
...
libraryDependencies +=
Hi,
I am running my graphx application with Spark 1.3.1 on a small cluster. Then it
failed on this exception:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 1
But actually I found it is caused by “ExecutorLostFailure” indeed, and someone
told
It worked.
On Thu, Jun 4, 2015 at 5:14 PM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:
Try using coalesce
Thanks Regards,
Meethu M
On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
I am running a series of spark functions with 9000 executors and its
Try using coalesce Thanks Regards,
Meethu M
On Wednesday, 3 June 2015 11:26 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
I am running a series of spark functions with 9000 executors and its resulting
in 9000+ files that is execeeding the namespace file count qutota.
How can Spark
Shixiong,
Thanks, interesting point. So if we want to only process one batch then
terminate the consumer, what's the best way to achieve that? Presumably the
listener could set a flag on the driver notifying it that it can terminate.
But the driver is not in a loop, it's basically blocked in
Hi Iulian,
On 26 May 2015, at 13:04, Iulian Dragoș iulian.dra...@typesafe.com wrote:
On Tue, May 26, 2015 at 10:09 AM, algermissen1971
algermissen1...@icloud.com wrote:
Hi,
I am setting up a project that requires Kafka support and I wonder what the
roadmap is for Scala 2.11 Support
Hi Tariq
You need to handle the transaction semantics yourself. You could for
example save from the dataframe to a staging table and then write to the
final table using a single atomic INSERT INTO finalTable from
stagingTable call. Remember to clear the staging table first to recover
from
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods but
not createDirectStream...
On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:
You can also try Dynamic Resource Allocation
or you could
1) convert dataframe to RDD
2) use mapPartitions and zipWithIndex within each partition
3) convert RDD back to dataframe you will need to make sure you preserve
partitioning
Deenar
On 1 June 2015 at 02:23, ayan guha guha.a...@gmail.com wrote:
If you are on spark 1.3, use
Hi Holden, Olivier
So for column you need to pass in a Java function, I have some sample
code which does this but it does terrible things to access Spark internals.
I also need to call a Hive UDAF in a dataframe agg function. Are there any
examples of what Column expects?
Deenar
On 2 June 2015
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's
direct stream isn't a receiver, it isn't required to cache data anywhere
unless you want it to.
If you want it, just call cache.
On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the
For your first question, please take a look at HADOOP-9922.
The fix is in hadoop-common module.
Cheers
On Thu, Jun 4, 2015 at 2:53 AM, Jean-Charles RISCH
risch.jeanchar...@gmail.com wrote:
Hello,
*(Before everything : I use IntellijIdea 14.0.1, SBT and Scala 2.11.6)*
This morning, I was
74 matches
Mail list logo