Hi ,
You can use FileUtil.copemerge API and specify the path to the folder where
saveAsTextFile is save the part text file.
Suppose your directory is /a/b/c/
use FileUtil.copeMerge(FileSystem of source, a/b/c, FileSystem of
destination, Path to the merged file say (a/b/c.txt), true(to delete
1) make sure your beeline client connected to Hiveserver2 of Spark SQL.
You can found execution logs of Hiveserver2 in the environment
of start-thriftserver.sh.
2) what about your scale of data. If cache with small data, it will take
more time to schedule workload between different executors.
Look
Hi
Thank you ☺Akhil it worked like charm…..
I used the file writer outside rdd.foreach that might be the reason for
nonserialisable exception….
Thanks Regards
Jishnu Menath Prathap
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Friday, November 21, 2014 1:15 PM
To: Jishnu Menath
Hi Simon,
no, I don't need to run the tasks on multiple machines for now.
I will therefore stick to Makefile + shell or Java programs as Spark appears
not to be the right tool for the tasks I am trying to accomplish.
Thanks you for your input.
Philippe
- Mail original -
De: Simon
You are probably casually sending UIMA objects from the driver to
executors in a closure. You'll have to design your program so that you
do not need to ship these objects to or from the remote task workers.
On Fri, Nov 21, 2014 at 8:39 AM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I am
Is there any way to get the yarn application_id inside the program?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applicationId-for-yarn-mode-both-yarn-client-and-yarn-cluster-mode-tp19462.html
Sent from the Apache Spark User List mailing
Hi all.
Here are two code snippets.
And they will produce the same result.
1.
rdd.map( function )
2.
rdd.map( function1 ).map( function2 ).map( function3 )
What are the pros and cons of these two methods?
Regards
Kevin
--
View this message in context:
Hi,
I'm going to debug some spark applications on our testing platform. And
it would be helpful if we can see the eventLog on the *worker *node.
I've tried to turn on *spark.eventLog.enabled* and set
*spark.eventLog.dir* parameters on the worker node. However, it doesn't
work.
I
You can get parameter such as spark.executor.memory, but you can not get
executor or core numbers.
Because executor and core are parameters of spark deploy environment not
spark context.
val conf = new SparkConf().set(spark.executor.memory,2G)
val sc = new SparkContext(conf)
Hi,
Everytime I start the spark-shell I encounter this message:
14/11/18 00:27:43 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
Any idea how to overcome it ?
the short-circuit feature is a big perfomance boost I don't want to
I suppose that here function(x) = function3(function2(function1(x)))
In that case, the difference will be modularity and readability of your
program.
If function{1,2,3} are logically different steps and potentially reusable
somewhere else, I'd keep them separate.
A sequence of map
Finally, I've found two ways:
1. search the output with something like Submitted application
application_1416319392519_0115
2. use specific AppName. We could query the ApplicationID(yarn)
--
View this message in context:
This sounds more like a use case for reduce? or fold? it sounds like
you're kind of cobbling together the same function on accumulators,
when reduce/fold are simpler and have the behavior you suggest.
On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
I think I
Hi,
I got an error during rdd.registerTempTable(...) saying scala.MatchError:
scala.BigInt
Looks like BigInt cannot be used in SchemaRDD, is that correct?
So what would you recommend to deal with it?
Thanks,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog:
Looks like metrics are not a hot topic to discuss - yet so important to
sleep well when jobs are running in production.
I've created Spark-4537 https://issues.apache.org/jira/browse/SPARK-4537
to track this issue.
-kr, Gerard.
On Thu, Nov 20, 2014 at 9:25 PM, Gerard Maas gerard.m...@gmail.com
It makes sense.
Thx. =)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-ALS-class-serializable-tp19262p19472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Dear all,
We encountered problems of failed jobs with huge amount of data.
A simple local test was prepared for this question at
https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
It generates 2 sets of key-value pairs, join them, selects distinct values
and counts data finally.
object
Hi,
You can also set the cores in the spark application itself .
http://spark.apache.org/docs/1.0.1/spark-standalone.html
On Wed, Nov 19, 2014 at 6:11 AM, Pat Ferrel-2 [via Apache Spark User List]
ml-node+s1001560n19238...@n3.nabble.com wrote:
OK hacking the start-slave.sh did it
On Nov
Hi,
Spark runs in local with a speed less than in cluster. Cluster machines
usually have a high configuration and also the tasks are distrubuted in
workers in order to get a faster result. So you will always find a
difference in speed when running in local and when running in cluster. Try
running
Hi,
Parallel processing of xml files may be an issue due to the tags in the xml
file. The xml file has to be intact as while parsing it matches the start
and end entity and if its distributed in parts to workers possibly it may
or may not find start and end tags within the same worker which will
Thanks for the pointer Michael.
I've downloaded spark 1.2.0 from
https://people.apache.org/~pwendell/spark-1.2.0-snapshot1/ and clone and
built the spark-avro repo you linked to.
When I run it against the example avro file linked to in the documentation
it works. However, when I try to load my
I've been able to load a different avro file based on GenericRecord with:
val person = sqlContext.avroFile(/tmp/person.avro)
When I try to call `first()` on it, I get NotSerializableException
exceptions again:
person.first()
...
14/11/21 12:59:17 ERROR Executor: Exception in task 0.0 in stage
Hi!
Sure, I'll post the info I grabbed once the cassandra tables values appear
in Tableau.
Best,
Jerome
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html
Sent from the Apache Spark User List mailing list
Hi,
Are there any way that I can setup a remote HDFS for Spark (more specific,
for Spark Streaming checkpoints)? The reason I'm asking is that our Spark
and HDFS do not run on the same machines. I've been looked around but still
no clue so far.
Thanks,
EH
--
View this message in context:
Hi naveen,
I dont think this is possible. If you are setting the master with your
cluster details you cannot execute any job from your local machine. You
have to execute the jobs inside your yarn machine so that sparkconf is able
to connect with all the provided details.
If this is not the case
Unfortunately whether it is possible to have both Spark and HDFS running on
the same machine is not under our control. :( Right now we have Spark and
HDFS running in different machines. In this case, is it still possible to
hook up a remote HDFS with Spark so that we can use Spark Streaming
I have seen the same behaviour while testing the latest spark 1.2.0
snapshot.
I'm trying the ReliableKafkaReceiver and it works quite well but the
checkpoints folder is always increasing in size. The receivedMetaData
folder remains almost constant in size but the receivedData folder is
always
Hello Jianshi,
The reason of that error is that we do not have a Spark SQL data type for
Scala BigInt. You can use Decimal for your case.
Thanks,
Yin
On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got an error during rdd.registerTempTable(...) saying
I have a job that searches for input recursively and creates a string of
pathnames to treat as one input.
The files are part-x files and they are fairly small. The job seems to take
a long time to complete considering the size of the total data (150m) and only
runs on the master machine.
I have also been struggling with reading avro. Very glad to hear that there
is a new avro library coming in Spark 1.2 (which by the way, seems to have
a lot of other very useful improvements).
In the meanwhile, I have been able to piece together several snippets/tips
that I found from various
Great - what can we do to make this happen? So should I file a JIRA to
track?
Thanks,
Arun
On Tue, Nov 18, 2014 at 11:46 AM, Andrew Ash and...@andrewash.com wrote:
I can see this being valuable for users wanting to live on the cutting
edge without building CI infrastructure themselves,
We've done this with reduce - that definitely works.
I've reworked the logic to use accumulators because, when it works, it's
5-10x faster
On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen so...@cloudera.com wrote:
This sounds more like a use case for reduce? or fold? it sounds like
you're kind of
I¹m running a Python script with spark-submit on top of YARN on an EMR
cluster with 30 nodes. The script reads in approximately 3.9 TB of data
from S3, and then does some transformations and filtering, followed by some
aggregate counts. During Stage 2 of the job, everything looks to complete
Great - posted here https://issues.apache.org/jira/browse/SPARK-4542
On Fri, Nov 21, 2014 at 1:03 PM, Andrew Ash and...@andrewash.com wrote:
Yes you should file a Jira and echo it out here so others can follow and
comment on it. Thanks Arun!
On Fri, Nov 21, 2014 at 12:02 PM, Arun Ahuja
Hi all,
I put some log files into sql tables through Spark and my schema looks like
this:
|-- timestamp: timestamp (nullable = true)
|-- c_ip: string (nullable = true)
|-- cs_username: string (nullable = true)
|-- s_ip: string (nullable = true)
|-- s_port: string (nullable = true)
|--
Hi Brett,
Are you noticing executors dying? Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?
-Sandy
On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer brett.me...@crowdstrike.com
wrote:
I’m running a Python script with spark-submit
Unfortunately, unless you impose restrictions on the XML file (e.g., where
namespaces are declared, whether entity replacement is used, etc.), you
really can't parse only a piece of it even if you have start/end elements
grouped together. If you want to deal effectively (and scalably) with
large
Hi Tobias,
One way to find out the number of executors is through
SparkContext#getExecutorMemoryStatus. You can find out the number of by
asking the SparkConf for the spark.executor.cores property, which, if not
set, means 1 for YARN.
-Sandy
On Fri, Nov 21, 2014 at 1:30 AM, Yanbo Liang
Actually, it's a real
On Tue Nov 18 2014 at 2:52:00 AM Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
see https://www.mail-archive.com/dev@spark.apache.org/msg03520.html for
one solution.
One issue with those XML files is that they cannot be processed line by
line in parallel; plus you
I am sorry the last line in the code is
file1Rdd.join(file2RddGrp.mapValues(names =
names.toSet)).collect().foreach(println)
so
My Code===val file1Rdd =
sc.textFile(/Users/sansub01/mycode/data/songs/names.txt).map(x =
(x.split(,)(0), x.split(,)(1)))val file2Rdd =
(sorry about the previous spam... google inbox didn't allowed me to cancel
the miserable sent action :-/)
So what I was about to say: it's a real PAIN tin the ass to parse the
wikipedia articles in the dump due to this mulitline articles...
However, there is a way to manage that quite easily,
Quick update:
It is a filter job that creates the error above, not the reduceByKey
Why would a filter cause an out of memory?
Here is my code
val inputgsup
=hdfs://+sparkmasterip+/user/sense/datasets/gsup/binary/30/2014/11/0[1-9]/part*;
val gsupfile =
According to the web UI I don¹t see any executors dying during Stage 2. I
looked over the YARN logs and didn¹t see anything suspicious, but I may not
have been looking closely enough. Stage 2 seems to complete just fine, it¹s
just when it enters Stage 3 that the results from the previous stage
Thanks, Jerome.
BTW, have you tried the CalliopeServer2 from tuplejump? I was able to quickly
connect from beeline/Squirrel to my Cassandra cluster using CalliopeServer2,
which extends Spark SQL Thrift Server. It was very straight forward.
Next step is to connect from Tableau, but I can't find
Hi Sameer,
You can try increasing the number of executor-cores.
-Jayant
On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I have been using MLLib's linear regression and I have some question
regarding the performance. We have a cluster of 10 nodes -- each
Hi Sameer,
You can also use repartition to create a higher number of tasks.
-Jayant
On Fri, Nov 21, 2014 at 12:02 PM, Jayant Shekhar jay...@cloudera.com
wrote:
Hi Sameer,
You can try increasing the number of executor-cores.
-Jayant
On Fri, Nov 21, 2014 at 11:18 AM, Sameer Tilak
I want to run queries on Apache Phoenix which has a JDBC driver. The query
that I want to run is:
select ts,ename from random_data_date limit 10
But I'm having issues with the JdbcRDD upper and lowerBound parameters
(that I don't actually understand).
Here's what I have so far:
import
Hi Alaa Ali,
In order for Spark to split the JDBC query in parallel, it expects an upper
and lower bound for your input data, as well as a number of partitions so
that it can split the query across multiple tasks.
For example, depending on your data distribution, you could set an upper
and lower
I tried using RDD#mapPartitions but my job completes prematurely and
without error as if nothing gets done. What I have is fairly simple
sc
.textFile(inputFile)
.map(parser.parse)
.mapPartitions(bulkLoad)
But the Iterator[T] of mapPartitions
I have a Spark java application that I run in local-mode. As such it runs
without any issues.
Now, I would like to run it as a webservice from Tomcat. The first issue I
had with this was that the spark-assembly jar contains javax.servlet, which
Tomcat does not allow. Therefore I removed
Awesome, thanks Josh, I missed that previous post of yours! But your code
snippet shows a select statement, so what I can do is just run a simple
select with a where clause if I want to, and then run my data processing on
the RDD to mimic the aggregation I want to do with SQL, right? Also,
another
Ali,
just create a BIGINT column with numeric values in phoenix and use sequences
http://phoenix.apache.org/sequences.html to populate it automatically
I included the setup below in case someone starts from scratch
Prerequisites:
- export JAVA_HOME, SCALA_HOME and install sbt
- install hbase in
Hello I am trying to read kafka stream to a text file by running spark from
my IDE (IntelliJ IDEA) . The code is similar as a previous thread on
persisting stream to a text file.
I am new to spark or scala. I believe the spark is on local mode as the
console shows
14/11/21 14:17:11 INFO
use the right email list.
-- Forwarded message --
From: Joanne Contact joannenetw...@gmail.com
Date: Fri, Nov 21, 2014 at 2:32 PM
Subject: Persist kafka streams to text file
To: u...@spark.incubator.apache.org
Hello I am trying to read kafka stream to a text file by running spark
I would expect an SQL query on c would fail because c would not be known in
the schema of the older Parquet file.
What I'd be very interested in is how to add a new column as an incremental
new parquet file, and be able to somehow join the existing and new file, in
an efficient way. IE, somehow
Hi Nathan,
It sounds like what you're asking for has already been filed as
https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match
what you're proposing?
Andrew
On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
We've done this with reduce -
Thanks for the info Andy. A big help.
One thing - I think you can figure out which document is responsible for which
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that
into one RDD for each column.
The values in the doc_string RDD
Yeah, I initially used zip but I was wondering how reliable it is. I mean,
it's the order guaranteed? What if some mode fail, and the data is pulled
out from different nodes?
And even if it can work, I found this implicit semantic quite
uncomfortable, don't you?
My0.2c
Le ven 21 nov. 2014 15:26,
Is the a book on SparkR for the absolute terrified beginner?
I use R for my daily analysis and I am interested in a detailed guide to
using SparkR for data analytics: like a book or online tutorials. If there's
any please direct me to the address.
Thanks,
Daniel
--
View this message in
Ah yes. I found it too in the manual. Thanks for the help anyway!
Since BigDecimal is just a wrapper around BigInt, let's also convert to
BigInt to Decimal.
I created a ticket. https://issues.apache.org/jira/browse/SPARK-4549
Jianshi
On Fri, Nov 21, 2014 at 11:30 PM, Yin Huai
When I submit a Spark Streaming job, I see these INFO logs printing
frequently:
14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216)
14/11/21 18:53:17 INFO DAGScheduler: failed: Set()
14/11/21 18:53:17 INFO DAGScheduler: Missing parents for Stage 216: List()
14/11/21 18:53:17 INFO
Hi Daniel,
Thanks for your email! We don't have a book (yet?) specifically on SparkR,
but here's a list of helpful tutorials / links you can check out (I am
listing them in roughly basic - advanced order):
- AMPCamp5 SparkR exercises
http://ampcamp.berkeley.edu/5/exercises/sparkr.html. This
bulkLoad has the connection to MongoDB ?
On Fri, Nov 21, 2014 at 4:34 PM, Benny Thompson ben.d.tho...@gmail.com
wrote:
I tried using RDD#mapPartitions but my job completes prematurely and
without error as if nothing gets done. What I have is fairly simple
sc
Hello Experts,
I have 5 worker machines with different size of RAM. is there a way to
configure it with different executor memory?
Currently I see that all worker spins up 1 executor with same amount of
memory.
Thanks Regards
Tridib
--
View this message in context:
Im not sure if it's an exact match, or just very close :-)
I don't think our problem is the workload on the driver, I think it's just
memory - so while the solution proposed there would work, it would also be
sufficient for our purposes, I believe, simply to clear each block as soon
as it's added
After taking today's build from master branch I started getting this error
when run spark-sql:
Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
I used following command for building:
./make-distribution.sh --tgz -Pyarn -Dyarn.version=2.4.0 -Phadoop-2.4
Hi,
Thrift server is failing to start for me on latest spark 1.2 branch.
I got the error below when I start thrift server.
Exception in thread main java.lang.NoClassDefFoundError: com/google/common/bas
e/Preconditions
at org.apache.hadoop.conf.Configuration$DeprecationDelta.init(Configur
I have seen similar posts on this issue but could not find solution.
Apologies if this has been discussed here before.
I am running a spark streaming job with yarn on a 5 node cluster. I am
using following command to submit my streaming job.
spark-submit --class class_name --master yarn-cluster
Hi Judy, could you please provide the commit SHA1 of the version you're
using? Thanks!
On 11/22/14 11:05 AM, Judy Nash wrote:
Hi,
Thrift server is failing to start for me on latest spark 1.2 branch.
I got the error below when I start thrift server.
Exception in thread main
69 matches
Mail list logo