when i run the *cache table as *in the beeline which communicate with the
thrift server i got the follow error:
14/12/19 15:57:05 ERROR ql.Driver: FAILED: ParseException line 1:0 cannot
recognize input near 'cache' 'table' 'jeanlyntest'
org.apache.hadoop.hive.ql.parse.ParseException: line 1:0
Hi guys,
I recently ran spark on yarn and found spark didn't set any log4j properties
file in configuration or code. And the log4j logs was writing into stderr
file under ${yarn.nodemanager.log-dirs}/application_${appid}.
I wanna know which side(spark or hadoop) controll the appender? Have
It seems that the Thrift server you connected to is the original
HiveServer2 rather than Spark SQL HiveThriftServer2.
On 12/19/14 4:08 PM, jeanlyn92 wrote:
when i run the *cache table as *in the beeline which communicate
with the thrift server i got the follow error:
14/12/19 15:57:05 ERROR
Wow. Nice to hear that :) Keep learning On 2014-12-19 15:10 , Matei Zaharia
Wrote: Yup, as he posted before, An Apache infrastructure issue prevented me
from pushing this last night. The issue was resolved today and I should be able
to push the final release artifacts tonight. On Dec 18, 2014,
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is
the third release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 172 developers and more
than 1,000 commits!
This release brings operational and performance improvements in Spark
That's not true in yarn-cluster mode, where the driver runs in a
container that YARN creates, which may not be on the machine that runs
spark-submit.
As far as I know, however, you can't control where YARN allocates
that, and shouldn't need to.
You can probably query YARN to find where it did
coalesce actually changes the number of partitions. Unless the
original RDD had just 1 partition, coalesce(1) will make an RDD with 1
partition that is larger than the original partitions, of course.
I don't think the question is about ordering of things within an
element of the RDD?
If the
Congrats!
A little question about this release: Which commit is this release based
on? v1.2.0 and v1.2.0-rc2 are pointed to different commits in
https://github.com/apache/spark/releases
Best Regards,
Shixiong Zhu
2014-12-19 16:52 GMT+08:00 Patrick Wendell pwend...@gmail.com:
I'm happy to
Hi,Say we have an operation that writes something to an external resource and
gets some output. For example:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val
result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)}
I could use a transformation for
Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get
updated. I assume it's going to be 1.2.0-rc2 plus a few commits
related to the release process.
On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu zsxw...@gmail.com wrote:
Congrats!
A little question about this release: Which commit
You've got Kerberos enabled, and it's complaining that it YARN doesn't
like the Kerberos config. Have you verified this should be otherwise
working, sans Spark?
On Fri, Dec 19, 2014 at 3:50 AM, maven niranja...@gmail.com wrote:
All,
I just built Spark-1.2 on my enterprise server (which has
Hi ,
I've just seen that streaming spark supports python from 1.2 version.
Question, does spark streaming (python version ) supports kafka integration?
Thanks
Oleg.
Hi,Say we have an operation that writes something to an external resource and
gets some output. For example:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val
result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)}
I
could use a transformation for
The following ticket:
https://issues.apache.org/jira/browse/SPARK-1812
for supporting 2.11 have been marked as fixed in 1.2,
but the docs in the Spark site still say that 2.10 is required.
Thanks,
Jon
Check out the 'compiling for Scala 2.11' instructions:
http://spark.apache.org/docs/1.2.0/building-spark.html#building-for-scala-211
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:00 PM, Jonathan Chayat jonatha...@supersonic.com
wrote:
The following ticket:
You might interpret that as 2.10+. Although 2.10 is still the main
version in use, I think, you can see 2.11 artifacts have been
published:
http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.11%7C1.2.0%7Cjar
On Fri, Dec 19, 2014 at 11:00 AM, Jonathan Chayat
To really be correct, I think you may have to use the foreach action
to persist your data, since this isn't idempotent, and then read it
again in a new RDD. You might get away with map as long as you can
ensure that your write process is idempotent.
On Fri, Dec 19, 2014 at 10:57 AM, ashic
Hi experts!
what is efficient way to read all files using spark from directory and its
sub-directories as well.currently i move all files from directory and it
sub-directories into another temporary directory and then read them all
using sc.textFile method. But I want a method so that moving to
Thanks Sean. That's kind of what I figured. Luckily, for my use case writes are
idempotent, so map works.
From: so...@cloudera.com
Date: Fri, 19 Dec 2014 11:06:51 +
Subject: Re: How to run an action and get output?
To: as...@live.com
CC: user@spark.apache.org
To really be correct, I
How about using the HDFS API to create a list of all the directories
to read from, and passing them as a comma-joined string to
sc.textFile?
On Fri, Dec 19, 2014 at 11:13 AM, Hafiz Mujadid
hafizmujadi...@gmail.com wrote:
Hi experts!
what is efficient way to read all files using spark from
Hi Guys,
Are scala lazy values instantiated once per executor, or once per partition?
For example, if I have:
object Something =
val lazy context = create()
def foo(item) = context.doSomething(item)
and I do
someRdd.foreach(Something.foo)
then will context get instantiated once per
Hi,
You can use FileInputformat API of Hadoop and newApiHadoopFile of spark to
get recursion. More on the topic you can refer here
http://stackoverflow.com/questions/8114579/using-fileinputformat-addinputpaths-to-recursively-add-hdfs-path
On Fri, Dec 19, 2014 at 4:50 PM, Sean Owen
I recently had the same problem. I'm not an expert but will suggest that you
concatenate your files into a smaller number of larger files. E.g. in Linux
cat files a_larger_file. This helped greatly.
Likely others better qualified will weigh in on this later but that's
something to get you
On hdfs I created:
/one/one.txt # contains text one
/one/two/two.txt # contains text two
Then:
val data = sc.textFile(/one/*)
data.collect
This returned:
Array(one, two)
So the above path designation appears to automatically recurse for you.
--
View this message in context:
thanks bethesda!
But if we have structure like this
a/b/a.txt
a/c/c.txt
a/d/e/e.txt
then how can we handle this case?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/reading-files-recursively-using-spark-tp20782p20785.html
Sent from the Apache Spark
A val in an object should be instantiated once per JVM (really,
ClassLoader, but probably won't make a difference here). Therefore I
expect it is going to live effectively as long as the executor, across
partitions but also across jobs.
On Fri, Dec 19, 2014 at 11:21 AM, Ashic Mahtab
It will be instantiated once per VM, which translates to once per executor.
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote:
Hi Guys,
Are scala lazy values instantiated once per executor, or once per
partition? For example, if I have:
object Something =
Hi,
I just saw an Edge RDD is 300% Fraction Cached” in Storage WebUI, what does
that mean? I can understand if the value was under 100%…
Thanks.
Best,
Yifan LI
Hi all,
I know the topic have been discussed before, but i couldn't find an answer
that might suits me.
How do you retrieve the current batch timestamp in spark streaming? Maybe
via BatchInfo but it does not seem to be linked to streaming context or
else... I currently have 1 minutes micro-batch
Most of the methods of DStream will let you supply a function that
receives a timestamp as an argument of type Time. For example, we have
def foreachRDD(foreachFunc: RDD[T] = Unit)
but also
def foreachRDD(foreachFunc: (RDD[T], Time) = Unit)
If you supply the latter, you will get the timestamp
I have a job that runs fine on relatively small input datasets but then
reaches a threshold where I begin to consistently get Fetch failure for
the Failure Reason, late in the job, during a saveAsText() operation.
The first error we are seeing on the Details for Stage page is
ExecutorLostFailure
Hi,
Thanks for the answer.
regarding 2,3, its indeed the solution, but as I mentioned in my question, I
can as well do input checks (using .map) before applying any other rdd
operations. I still think that its overhead.
Regarding 1, this will make all the other rdd operations more complex, as I
There is a Jira on this. According to the comment there it means that a block
is cached in more than 1 location. I dont know why this would happen ( i used
1x replication when I saw this). Curious if someone has a more indepth
explanation
Sent on the new Sprint Network from my Samsung Galaxy
Hi,
According to Spark documentation the data sharing between two different
Spark contexts is not possible.
So I just wonder if it is possible to first run a job that loads some data
from DB into Schema RDDs, then cache it and next register it as a temp
table (let's say Table_1), now I would
I'm getting the same error (ExecutorLostFailure) - input RDD is 100k
small files (~2MB each). I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...). Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory ==
I’m using Spark 1.1.0 built for HDFS 2.4.
My application enables check-point (to HDFS 2.5.1) and it can build. But when I
run it, I get below error:
Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC
version 9 cannot communicate with client version 4
at
Just to confirm, once per VM means that it'll be the same instance across all
applications in a particular JVM instance (i.e. executor). So even if the spark
application is terminated, the instance will live on, correct? I think that's
what Sean said, and it seems logical.
From:
Hi Jon,
The fix for this is to increase spark.yarn.executor.memoryOverhead to something
greater than it's default of 384.
This will increase the gap between the executors heap size and what it requests
from yarn. It's required because jvms take up some memory beyond their heap
size.
-Sandy
Yes, but your error indicates that your application is actually using
Hadoop 1.x of some kind. Check your dependencies, especially
hadoop-client.
On Fri, Dec 19, 2014 at 2:11 PM, Haopu Wang hw...@qilinsoft.com wrote:
I’m using Spark 1.1.0 built for HDFS 2.4.
My application enables check-point
It seems there is hadoop 1 somewhere in the path.
On Fri, Dec 19, 2014, 21:24 Sean Owen so...@cloudera.com wrote:
Yes, but your error indicates that your application is actually using
Hadoop 1.x of some kind. Check your dependencies, especially
hadoop-client.
On Fri, Dec 19, 2014 at 2:11
I'm actually already running 1.1.1.
I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
luck. Still getting ExecutorLostFailure (executor lost).
On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny rafal.kwa...@gmail.com
wrote:
Hi,
Just upgrade to 1.1.1 - it was fixed some
Sean,
Thanks for your response. My MapReduce and Spark 1.0 (prepackaged in CDH5)
jobs are running fine. It's only Spark1.2 jobs that I'm unable to run
NR
On Dec 19, 2014 5:03 AM, Sean Owen so...@cloudera.com wrote:
You've got Kerberos enabled, and it's complaining that it YARN doesn't
like
Do you hit the same errors? Is it now saying your containers are exceed
~10 GB?
On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm actually already running 1.1.1.
I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
luck. Still getting
Hmmm, I see this a lot (multiple times per second) in the stdout logs of my
application:
2014-12-19T16:12:35.748+: [GC (Allocation Failure) [ParNew:
286663K-12530K(306688K), 0.0074579 secs] 1470813K-1198034K(2063104K),
0.0075189 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
And finally
Thanks Aniket , clears a lot of confusion.
On Dec 14, 2014 7:11 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:
The reason is because of the following code:
val numStreams = numShards
val kinesisStreams = (0 until numStreams).map { i =
KinesisUtils.createStream(ssc, streamName,
Hi,
Sorry for repeating the same question, just wanted to clarify the issue :
Is it possible to expose a RDD (or SchemaRDD) to external components
(outside spark) so it can be queried over JDBC (my goal is not to place
the RDD back in a database but use this cached RDD to server JDBC queries)
Yes, same problem.
On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Do you hit the same errors? Is it now saying your containers are exceed
~10 GB?
On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase jon.ch...@gmail.com wrote:
I'm actually already running 1.1.1.
I
Yes you can, using HiveContext, a metastore and the thriftserver. The
metastore persists information about your SchemaRDD, and the HiveContext,
initialised with information on the metastore, can interact with the
metastore. The thriftserver provides JDBC connections using the metastore.
Using
Looking at:
http://search.maven.org/#browse%7C717101892
The dates of the jars were still of Dec 10th.
Was I looking at the wrong place ?
Cheers
On Thu, Dec 18, 2014 at 11:10 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Yup, as he posted before, An Apache infrastructure issue prevented me
The dates of the jars were still of Dec 10th.
I figured that was because the jars were staged in Nexus on that date
(before the vote).
On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu yuzhih...@gmail.com wrote:
Looking at:
http://search.maven.org/#browse%7C717101892
The dates of the jars were
Q: In Spark Streaming if your DStream transformation and output action take
longer than the batch duration will the system process the next batch in
another thread? Or will it just wait until the first batch’s RDD is
processed? In other words does it build up a queue of buffered RDDs
awaiting
I notice new methods such as JavaSparkContext makeRDD (with few useful
examples) - It takes a Seq but while there are ways to turn a list into a
Seq I see nothing that uses an Iterable
Batches will wait for the previous batch to finish. The monitoring console will
show you the backlog of waiting batches.
From: Asim Jalis asimja...@gmail.commailto:asimja...@gmail.com
Date: Friday, December 19, 2014 at 1:16 PM
To: user user@spark.apache.orgmailto:user@spark.apache.org
Subject:
registered with
20141219-110602-16777343-5050-658-0028
15:49:32.550 [Timer-0] WARN o.a.s.scheduler.TaskSchedulerImpl - Initial
job has not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient memory
15:49:47.547 [Timer-0] WARN
So , at any point does a stream stop producing RDDs ? If not, is there a
possibility, if the batching isnt working or is broken, that your disk /
RAM will fill up to the brim w/ unprocessed RDD backlog?
On Fri, Dec 19, 2014 at 1:29 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Hello,
I'm experiencing an issue where yarn is scheduling two executors (the default)
regardless of what I enter as num-executors when submitting an application.
Background: I'm running Spark with Yarn on Amazon EMR. My cluster has two core
nodes and three task nodes. All five nodes are
Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page. I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.
Test: (in spark-shell)
case class Person(name: String, age: Int)
val peopleList =
Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page. I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.
Test: (in spark-shell)
case class Person(name: String, age: Int)
val peopleList =
This is experimental, but you can start the JDBC server from within your
own programs
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45
by
passing it the HiveContext.
On Fri, Dec 19, 2014 at 6:04
AFAIK it's a known issue of some sort in the Scala REPL, which is what
the Spark REPL is. The PR that was closed was just adding tests to
show it's a bug. I don't know if there is any workaround now.
On Fri, Dec 19, 2014 at 7:21 PM, Jay Hutfles jayhutf...@gmail.com wrote:
Found a problem in the
Hi all,
I'm developing a spark application where I need to iteratively update an
RDD over a large number of iterations (1000+). From reading online,
I've found that I should use .checkpoint() to keep the graph from
growing too large. Even when doing this, I keep getting
Hi All,
I notice if we create a spark context in driver, we need to call stop method
to clear it.
SparkConf sparkConf = new
SparkConf().setAppName(FinancialEngineExecutor);
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
.
String
Can Spark be built with Hadoop 2.6? All I see instructions up to are for 2.4
and there does not seem to be a hadoop2.6 profile. If it works with Hadoop
2.6, can anyone recommend how to build?
--
View this message in context:
Yesterday, I changed the domain name in the mailing list archive settings
to remove .incubator so maybe it'll work now.
However, I also sent two emails about this through the nabble interface (in
this same thread) yesterday and they don't appear to have made it through
so not sure if it actually
You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0
Cheers
On Fri, Dec 19, 2014 at 12:51 PM, sa asuka.s...@gmail.com wrote:
Can Spark be built with Hadoop 2.6? All I see instructions up to are for
2.4
and there does not seem to be a hadoop2.6 profile. If it works with Hadoop
2.6,
Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn
to use the number of executors that I specify in spark-submit:
--num-executors 2
In a cluster with two core nodes will typically only result in one executor
running at a time. I can play with the memory settings and
Andy:
I saw two emails from you from yesterday.
See this thread: http://search-hadoop.com/m/JW1q5opRsY1
Cheers
On Fri, Dec 19, 2014 at 12:51 PM, Andy Konwinski andykonwin...@gmail.com
wrote:
Yesterday, I changed the domain name in the mailing list archive settings
to remove .incubator so
Hey Michael,
Thank you for clarifying that. Is tachyon the right way to get compressed
data in memory or should we explore the option of adding compression to
cached data. This is because our uncompressed data set is too big to fit in
memory right now. I see the benefit of tachyon not just with
Hello,
I was successfully using my own customized Hadoop InputFormat class with
JavaSparkContext.newAPIHadoopFile(...)
Is there any way I can reuse my class in Spark Streaming?
soroka21
--
View this message in context:
Yeah, tachyon does sound like a good option here. Especially if you have
nested data, its likely that parquet in tachyon will always be better
supported.
On Fri, Dec 19, 2014 at 2:17 PM, Sadhan Sood sadhan.s...@gmail.com wrote:
Hey Michael,
Thank you for clarifying that. Is tachyon the right
Hi I am facing an issue with mysql jars with spark-submit.
I am not running in yarn mode.
spark-submit --jars $(echo mysql-connector-java-5.1.34-bin.jar | tr ' ' ',')
--class com.abc.bcd.GetDBSomething myjar.jar abc bcd
Any help is really appreciated.
Thanks,
-D
14/12/19 23:42:10 INFO
Soroka,
You should be able to use the filestream() method of the
JavaStreamingContext. In case you need something more custom, the code
below is something I developed to provide the max functionality of the
Scala method, but implemented in Java.
//Set these to reflect your app and input format
My application doesn’t depends on hadoop-client directly.
It only depends on spark-core_2.10 which depends on hadoop-client 1.0.4. This
can be checked by Maven repository at
http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.1.0
That’s strange and how to workaround the
On Fri, Dec 19, 2014 at 4:05 PM, Haopu Wang hw...@qilinsoft.com wrote:
My application doesn’t depends on hadoop-client directly.
It only depends on spark-core_2.10 which depends on hadoop-client 1.0.4.
This can be checked by Maven repository at
To clarify, there isn't a Hadoop 2.6 profile per se but you can build using
-Dhadoop.version=2.4 which works with Hadoop 2.6.
On Fri, Dec 19, 2014 at 12:55 Ted Yu yuzhih...@gmail.com wrote:
You can use hadoop-2.4 profile and pass -Dhadoop.version=2.6.0
Cheers
On Fri, Dec 19, 2014 at 12:51
Hi Sean,
I change Spark as provided dependency and declare hadoop-client 2.5.1 as
compile dependency.
Now I see this error when do “mvn package”. Do you know what could be the
reason?
[INFO] --- scala-maven-plugin:3.1.3:compile (default) @ testspark ---
[WARNING] Expected all
Here is the command I used:
mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0
-Phive -DskipTests
FYI
On Fri, Dec 19, 2014 at 4:35 PM, Denny Lee denny.g@gmail.com wrote:
To clarify, there isn't a Hadoop 2.6 profile per se but you can build
using
Sorry Ted! I saw profile (-P) but missed the -D. My bad!
On Fri, Dec 19, 2014 at 16:46 Ted Yu yuzhih...@gmail.com wrote:
Here is the command I used:
mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0 -Phive -DskipTests
FYI
On Fri, Dec 19, 2014 at 4:35 PM, Denny
Thanks Michael, that makes sense.
On Fri, Dec 19, 2014 at 3:13 PM, Michael Armbrust mich...@databricks.com
wrote:
Yeah, tachyon does sound like a good option here. Especially if you have
nested data, its likely that parquet in tachyon will always be better
supported.
On Fri, Dec 19, 2014
How many cores / memory do you have available per NodeManager, and how
many cores / memory are you requesting for your job?
Remember that in Yarn mode, Spark launches num executors + 1
containers. The extra container, by default, reserves 1 core and about
1g of memory (more if running in cluster
Hi All,
Is there any API that can be used directly to write schemaRDD to HBase??
If not, what is the best way to write schemaRDD to HBase.
Thanks
Subacini
81 matches
Mail list logo