Hi all,
what would happen if I save a RDD via saveAsParquetFile to the same path
that RDD is originally read from? Is that a safe thing to do in Pyspark?
Thanks!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Did you find any other way for this issue?
I just found out that i have 22 columns data set... And now i am searching
for best solution.
Anyone else have experienced with this problem?
Best
Bojan
--
View this message in context:
I’ve tried with latest code, seems it works, which version are you using Shahab?
From: yana [mailto:yana.kadiy...@gmail.com]
Sent: Wednesday, March 4, 2015 8:47 PM
To: shahab; user@spark.apache.org
Subject: RE: Does SparkSQL support . having count (fieldname) in SQL
statement?
I think the
This doesn't involve spark at all, I think this is entirely an issue with
how scala deals w/ primitives and boxing. Often it can hide the details
for you, but IMO it just leads to far more confusing errors when things
don't work out. The issue here is that your map has value type Any, which
You can set the number of partitions dynamically -- its just a parameter to
a method, so you can compute it however you want, it doesn't need to be
some static constant:
val dataSizeEstimate = yourMagicFunctionToEstimateDataSize()
val numberOfPartitions =
Hello,
We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
We use spark-submit to start an application.
We got the following error which leads to a failed stage:
Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4
times, most recent failure: Lost task
+user
On Wed, Mar 4, 2015, 8:21 AM Xiangrui Meng men...@gmail.com wrote:
You can use the save/load implementation in naive Bayes as reference:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
Ping me on the JIRA page to
Not yet,
Please let. Me know if you found solution,
Regards
Sachin
On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List]
ml-node+s1001560n21909...@n3.nabble.com wrote:
Hello,
I am facing the exact same issue. Could you solve the problem ?
Kind regards
--
Thanks Cheng, my problem was some misspelling problem which I just fixed,
unfortunately the exception message sometimes does not pin point to exact
reason. Sorry my bad.
On Wed, Mar 4, 2015 at 5:02 PM, Cheng, Hao hao.ch...@intel.com wrote:
I’ve tried with latest code, seems it works, which
Actually your Pregel code works for me:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexlist = Array((1L,One), (2L,Two), (3L,Three),
(4L,Four),(5L,Five),(6L,Six))
val edgelist = Array(Edge(6,5,6 to 5),Edge(5,4,5 to 4),Edge(4,3,4 to 3),
I tried. I still get the same error.
15/03/04 09:01:50 INFO parse.ParseDriver: Parsing command: select * from
TableName where value like '%Restaurant%'
15/03/04 09:01:50 INFO parse.ParseDriver: Parse Completed.
15/03/04 09:01:50 INFO metastore.HiveMetaStore: 0: get_table : db=default
Hi All,
I am trying to create a class that wraps functionalities that I need; some
of these functions require access to the SparkContext, which I would like to
pass in. I know that the SparkContext is not seralizable, and I am not
planning on passing it to worker nodes or anything, I just want
Hi Todd and Marcelo,
Thanks for helping me. I was to able to lunch the history server on windows
with out any issues. One problem I am running into right now. I always get
the message no completed applications found in history server UI. But I was
able to browse through these applications from
Hi Sparkers,
How do i integrate hbase on spark !!!
Appreciate for replies !!
Regards,
Sandeep.v
Follow up:
We re-retried, this time after *decreasing* spark.parallelism. It was set
to 16000 before, (5 times the number of cores in our cluster). It is now
down to 6400 (2 times the number of cores).
And it got past the point where it failed before.
Does the MapOutputTracker have a limit on
Hello,
I was wondering where all the logs files were located on a standalone
cluster:
1. the executor logs are in the work directory on each slave machine
(stdout/stderr)
- I've notice that GC information is in stdout, and stage information
in stderr
- *Could we get more
Hi,
I am tuning a hive dataset on Spark SQL deployed via thrift server.
How can I change the number of partitions after caching the table on thrift
server?
I have tried the following but still getting the same number of partitions
after caching:
Spark.default.parallelism
Yes. I do see files, actually I missed copying the other settings:
spark.master spark://
skarri-lt05.redmond.corp.microsoft.com:7077
spark.eventLog.enabled true
spark.rdd.compress true
spark.storage.memoryFraction 1
spark.core.connection.ack.wait.timeout 6000
I meant spark.default.parallelism of course.
On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber thomas.ger...@radius.com
wrote:
Follow up:
We re-retried, this time after *decreasing* spark.parallelism. It was set
to 16000 before, (5 times the number of cores in our cluster). It is now
down to
On Wed, Mar 4, 2015 at 10:08 AM, Srini Karri skarri@gmail.com wrote:
spark.executor.extraClassPath
D:\\Apache\\spark-1.2.1-bin-hadoop2\\spark-1.2.1-bin-hadoop2.4\\bin\\classes
spark.eventLog.dir
D:/Apache/spark-1.2.1-bin-hadoop2/spark-1.2.1-bin-hadoop2.4/bin/tmp/spark-events
Hello,
Is there in built function for getting median and standard deviation in
spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling
doubleRDD.stats(). But still it does not have median.
What is the most efficient way to get the median?
Thanks Regards
Tridib
--
View
Hi Marcelo,
I found the problem from
http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3cCAL+LEBfzzjugOoB2iFFdz_=9TQsH=DaiKY=cvydfydg3ac5...@mail.gmail.com%3e
this link. The problem is the application I am running, is not generating
APPLICATION_COMPLETE file. If I add this file
No, this is not safe to do.
On Wed, Mar 4, 2015 at 7:14 AM, Karlson ksonsp...@siberie.de wrote:
Hi all,
what would happen if I save a RDD via saveAsParquetFile to the same path
that RDD is originally read from? Is that a safe thing to do in Pyspark?
Thanks!
Thanks!
On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com
wrote:
It is somewhat out of data, but here is what we have so far:
https://github.com/marmbrus/sql-typed
On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com
wrote:
I am pretty sure that I
Hello,
sometimes, in the *middle* of a job, the job stops (status is then seen as
FINISHED in the master).
There isn't anything wrong in the shell/submit output.
When looking at the executor logs, I see logs like this:
15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
Please take a look at DoubleRDDFunctions.scala :
/** Compute the mean of this RDD's elements. */
def mean(): Double = stats().mean
/** Compute the variance of this RDD's elements. */
def variance(): Double = stats().variance
/** Compute the standard deviation of this RDD's elements.
Hi ,
I am coverting jsonRDD to parquet by saving it as parquet file
(saveAsParquetFile)
cacheContext.jsonFile(file:///u1/sample.json).saveAsParquetFile(sample.parquet)
I am reading parquet file and registering it as a table :
val parquet = cacheContext.parquetFile(sample_trades.parquet)
I am pretty sure that I saw a presentation where SparkSQL could be executed
with static analysis, however I cannot find the presentation now, nor can I
find any documentation or research papers on the topic. So, I am curious if
there is indeed any work going on for this topic. The two things I
It is somewhat out of data, but here is what we have so far:
https://github.com/marmbrus/sql-typed
On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com
wrote:
I am pretty sure that I saw a presentation where SparkSQL could be executed
with static analysis, however I cannot
Hi,
There are some examples in spark/example
https://github.com/apache/spark/tree/master/examples and there are also
some examples in spark package http://spark-packages.org/.
And I find this blog
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
is quite good.
Hope it
look at the logs
yarn logs --applicationId applicationId
That should give the error.
On Wed, Mar 4, 2015 at 9:21 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Not yet,
Please let. Me know if you found solution,
Regards
Sachin
On 4 Mar 2015 21:45, mael2210 [via Apache Spark User List]
Hi All,
I am currently having problem with the maven dependencies for version 1.2.0
of spark-core and spark-hive. Here are my dependencies:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.2.0/version
/dependency
dependency
Hi Kevin,
If you're using CDH, I'd recommend using the CDH repo [1], and also
the CDH version when building your app.
[1]
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html
On Wed, Mar 4, 2015 at 4:34 PM, Kevin Peng kpe...@gmail.com wrote:
I ‘m sorry, but how to look at the mesos logs?
where are them?
在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道:
You can check in the mesos logs and see whats really happening.
Thanks
Best Regards
On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com
Hi,
I have a set of machines (say 5) and want to evenly launch a number (say 8) of
kafka receivers on those machines. In my code I did something like the
following, as suggested in the spark docs: val streams = (1 to
numReceivers).map(_ = ssc.receiverStream(new MyKafkaReceiver()))
Seems like someone set up m2.mines.com as a mirror in your pom file
or ~/.m2/settings.xml, and it doesn't mirror Spark 1.2 (or does but is
in a messed up state).
On Wed, Mar 4, 2015 at 3:49 PM, kpeng1 kpe...@gmail.com wrote:
Hi All,
I am currently having problem with the maven dependencies for
Also,
I was experiencing another problem which might be related:
Error communicating with MapOutputTracker (see email in the ML today).
I just thought I would mention it in case it is relevant.
On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
1.2.1
Also, I was
Thanks, it's an active project.
Will it be released with Spark 1.3.0?
From: 鹰 [mailto:980548...@qq.com]
Sent: Thursday, March 05, 2015 11:19 AM
To: Haopu Wang; user
Subject: Re: Where can I find more information about the R interface forSpark?
you can
It use HashPartitioner to distribute the record to different partitions, but
the key is just integer evenly across output partitions.
From the code, each resulting partition will get very similar number of
records.
Thanks.
Zhan Zhang
On Mar 4, 2015, at 3:47 PM, Du Li
When I run Spark 1.2.1, I found these display that wasn't in the previous
releases:
[Stage 12:= (6 + 1) /
16]
[Stage 12:(8 + 1) /
16]
[Stage 12:==
Do you have any update on SparkR?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-more-information-about-the-R-interface-for-Spark-tp155p21922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Marcelo,
Thanks. The one in the CDH repo fixed it :)
On Wed, Mar 4, 2015 at 4:37 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Kevin,
If you're using CDH, I'd recommend using the CDH repo [1], and also
the CDH version when building your app.
[1]
you can search SparkR on google or search it on github
What release are you using ?
SPARK-3923 went into 1.2.0 release.
Cheers
On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com
wrote:
Hello,
sometimes, in the *middle* of a job, the job stops (status is then seen
as FINISHED in the master).
There isn't anything wrong in
Ted,
I am currently using CDH 5.3 distro, which has Spark 1.2.0, so I am not too
sure about the compatibility issues between 1.2.0 and 1.2.1, that is why I
would want to stick to 1.2.0.
On Wed, Mar 4, 2015 at 4:25 PM, Ted Yu yuzhih...@gmail.com wrote:
Kevin:
You can try with 1.2.1
See
Hi,
My RDD's are created from kafka stream. After receiving a RDD, I want to do
coalesce/repartition it so that the data will be processed in a set of machines
in parallel as even as possible. The number of processing nodes is larger than
the receiving nodes.
My question is how the
I 'm using spark1.0.0 with cloudera.
but I want to use new als code which supports more features, such as rdd
cache level(MEMORY ONLY), checkpoint, and so on.
What is the easiest way to use the new als code?
I only need the mllib als code, so maybe I don't need to update all the
spark mllib
1.2.1
Also, I was using the following parameters, which are 10 times the default
ones:
spark.akka.timeout 1000
spark.akka.heartbeat.pauses 6
spark.akka.failure-detector.threshold 3000.0
spark.akka.heartbeat.interval 1
which should have helped *avoid* the problem if I understand
One really interesting is that when I'm using the
netty-based spark.shuffle.blockTransferService, there's no OOM error
messages (java.lang.OutOfMemoryError: Java heap space).
Any idea why it's not here?
I'm using Spark 1.2.1.
Jianshi
On Thu, Mar 5, 2015 at 1:56 PM, Jianshi Huang
It depends on your setup but one of the locations is /var/log/mesos
On Wed, Mar 4, 2015 at 19:11 lisendong lisend...@163.com wrote:
I ‘m sorry, but how to look at the mesos logs?
where are them?
在 2015年3月4日,下午6:06,Akhil Das ak...@sigmoidanalytics.com 写道:
You can check in the mesos logs
Figured it out: I need to override method preferredLocation() in MyReceiver
class.
On Wednesday, March 4, 2015 3:35 PM, Du Li l...@yahoo-inc.com.INVALID
wrote:
Hi,
I have a set of machines (say 5) and want to evenly launch a number (say 8) of
kafka receivers on those machines. In
Hi Du,
You could try to sleep for several seconds after creating streaming context to
let all the executors registered, then all the receivers can distribute to the
nodes more evenly. Also setting locality is another way as you mentioned.
Thanks
Jerry
From: Du Li
There're some skew.
6461640SUCCESSPROCESS_LOCAL200 / 2015/03/04 23:45:471.1 min6 s198.6 MB21.1
GB240.8 MB5961590SUCCESSPROCESS_LOCAL30 / 2015/03/04 23:45:4744 s5 s200.7
MB4.8 GB154.0 MB
But I expect this kind of skewness to be quite common.
Jianshi
On Thu, Mar 5, 2015 at 3:48 PM,
Yes, hostname is enough.
I think currently it is hard for user code to get the worker list from
standalone master. If you can get the Master object, you could get the worker
list, but AFAIK may be it is difficult to get this object. All you could do is
to manually get the worker list and
Yes, if one key has too many values, there still has a chance to meet the OOM.
Thanks
Jerry
From: Jianshi Huang [mailto:jianshi.hu...@gmail.com]
Sent: Thursday, March 5, 2015 3:49 PM
To: Shao, Saisai
Cc: Cheng, Hao; user
Subject: Re: Having lots of FetchFailedException in join
I see. I'm using
Hi,
On Thu, Mar 5, 2015 at 12:20 AM, Imran Rashid iras...@cloudera.com wrote:
This doesn't involve spark at all, I think this is entirely an issue with
how scala deals w/ primitives and boxing. Often it can hide the details
for you, but IMO it just leads to far more confusing errors when
Hi,
In our project, we use stand alone duo master + zookeeper to make
the HA of spark master.
Now the problem is, how do we know which master is the current alive
master?
We tried to read the info that the master stored in zookeeper. But we
found there is no information to
Which spark version did you use? I tried spark-1.2.1 and didn’t meet this
problem.
scala val m = hiveContext.sql( select * from testtable where value like
'%Restaurant%')
15/03/05 02:02:30 INFO ParseDriver: Parsing command: select * from testtable
where value like '%Restaurant%'
15/03/05
Please follow SPARK-5654
On Wed, Mar 4, 2015 at 7:22 PM, Haopu Wang hw...@qilinsoft.com wrote:
Thanks, it's an active project.
Will it be released with Spark 1.3.0?
--
*From:* 鹰 [mailto:980548...@qq.com]
*Sent:* Thursday, March 05, 2015 11:19 AM
*To:*
Replace
val sqlContext = new SQLContext(sparkContext)
with
@transient
val sqlContext = new SQLContext(sparkContext)
-Original Message-
From: kpeng1 [mailto:kpe...@gmail.com]
Sent: 04 March 2015 23:39
To: user@spark.apache.org
Subject: Passing around SparkContext with in the Driver
Hi
I am trying to read RDD avro, transform and write.
I am able to run it locally fine but when i run onto cluster, i see issues
with Avro.
export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1
export SPARK_YARN_USER_ENV=CLASSPATH=/apache/hadoop/conf
export
Hi Jerry,
Thanks for your response.
Is there a way to get the list of currently registered/live workers? Even in
order to provide preferredLocation, it would be safer to know which workers are
active. Guess I only need to provide the hostname, right?
Thanks,Du
On Wednesday, March 4, 2015
Hi Jianshi,
From my understanding, it may not be the problem of NIO or Netty, looking at
your stack trace, the OOM is occurred in EAOM(ExternalAppendOnlyMap),
theoretically EAOM can spill the data into disk if memory is not enough, but
there might some issues when join key is skewed or key
Hi Saisai,
What's your suggested settings on monitoring shuffle? I've
enabled -XX:+PrintGCDetails -XX:+PrintGCTimeStamps for GC logging.
I found SPARK-3461 (Support external groupByKey using
repartitionAndSortWithinPartitions) want to make groupByKey using external
storage. It's still open
I think what you could do is to monitor through web UI to see if there’s any
skew or other symptoms in shuffle write and read. For GC you could use the
below configuration as you mentioned.
From Spark core side, all the shuffle related operations can spill the data
into disk and no need to
I see. I'm using core's join. The data might have some skewness (checking).
I understand shuffle can spill data to disk but when consuming it, say in
cogroup or groupByKey, it still needs to read the whole group elements,
right? I guess OOM happened there when reading very large groups.
Jianshi
Hi,
Trying to run spark 1.2.1 w/ hadoop 1.0.4 on cluster and configure it to
run with log4j2.
Problem is that spark-assembly.jar contains log4j and slf4j classes
compatible with log4j 1.2 in it, and so it detects it should use log4j 1.2 (
Friends,
I'm trying to parse json formatted Kafka messages and then send back to
cassandra.I have two problems:
1. I got the exception below. How to check an empty RDD?
Exception in thread main java.lang.UnsupportedOperationException: empty
collection
at
Generally the location of logs in /var/log/mesos but the definitive
configuration can be found via the /etc/mesos-master/... configuration
files. There should be a configuration file labeled log_dir.
ps -ax | grep mesos should also show the output of the configuration if it
is configured.
I changed spark.shuffle.blockTransferService to nio and now I'm getting OOM
errors, I'm doing a big join operation.
15/03/04 19:04:07 ERROR Executor: Exception in task 107.0 in stage 2.0 (TID
6207)
java.lang.OutOfMemoryError: Java heap space
at
14/06/19 15:03:36 WARN LoadSnappy: Snappy native library not loaded
The problem is Snappy library is not loaded in the workers. This is because
you would have written the system.loadlibrary outside map function which is
not shipped to the workers.
Regards
Jishnu Prathap
--
View this
I've also tried the following:
Configuration hadoopConfiguration = new Configuration();
hadoopConfiguration.set(multilinejsoninputformat.member, itemSet);
JavaStreamingContext ssc =
JavaStreamingContext.getOrCreate(checkpointDirectory, hadoopConfiguration,
factory, false);
but I
When you say multiple directories, make sure those directories are
available and spark have permission to write to those directories. You can
look at the worker logs to see the exact reason of failure.
Thanks
Best Regards
On Tue, Mar 3, 2015 at 6:45 PM, lisendong lisend...@163.com wrote:
As
We have gotten it to work...
--- Original Message ---
From: nitinkak001 nitinkak...@gmail.com
Sent: March 3, 2015 7:46 AM
To: user@spark.apache.org
Subject: Re: Running Spark jobs via oozie
I am also starting to work on this one. Did you get any solution to this
issue?
--
View this message
Thanks very much, I used it and works fine with me.
On 4 March 2015 at 11:56, Arush Kharbanda ar...@sigmoidanalytics.com
wrote:
For java You can use hive-jdbc connectivity jars to connect to Spark-SQL.
The driver is inside the hive-jdbc Jar.
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard
from server in 26679ms for sessionid 0x34bbf3313a8001b, closing socket
connection and attempting reconnect
15/03/04 09:26:36 INFO ConnectionStateManager: State change: SUSPENDED
15/03/04 09:26:36 INFO
You can check in the mesos logs and see whats really happening.
Thanks
Best Regards
On Wed, Mar 4, 2015 at 3:10 PM, lisendong lisend...@163.com wrote:
15/03/04 09:26:36 INFO ClientCnxn: Client session timed out, have not heard
from server in 26679ms for sessionid 0x34bbf3313a8001b, closing
That could be a corner case bug. How do you add the 3rd party library to
the class path of the driver? Through spark-submit? Could you give the
command you used?
TD
On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote:
I've also tried the following:
Configuration
You can look at the following
- spark.akka.timeout
- spark.akka.heartbeat.pauses
from http://spark.apache.org/docs/1.2.0/configuration.html
Thanks
Best Regards
On Tue, Mar 3, 2015 at 4:46 PM, twinkle sachdeva twinkle.sachd...@gmail.com
wrote:
Hi,
Is there any relation between removing
Hm, what do you mean? You can control, to some extent, the number of
partitions when you read the data, and can repartition if needed.
You can set the default parallelism too so that it takes effect for most
ops thay create an RDD. One # of partitions is usually about right for all
work (2x or so
Hi, in the roadmap of Spark in 2015 (link:
http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p
ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark
Streaming and Spark SQL.
My question is: what's the typical usage of SchemaRDD in a Spark
Streaming application?
The file stream does not use receiver. May be that was not clear in the
programming guide. I am updating it for 1.3 release right now, I will make
it more clear.
And file stream has full reliability. Read this in the programming guide.
I am a beginner to Spark, having some simple questions regarding the use of
RDD in python.
Suppose I have a matrix called data_matrix, I pass it to RDD using
RDD_matrix = sc.parallelize(data_matrix)
but I will have a problem if I want to know the dimension of the matrix in
Spark, because Sparkk
Hi,
I have a function with signature
def aggFun1(rdd: RDD[(Long, (Long, Double))]):
RDD[(Long, Any)]
and one with
def aggFun2[_Key: ClassTag, _Index](rdd: RDD[(_Key, (_Index, Double))]):
RDD[(_Key, Double)]
where all Double classes involved are scala.Double classes (according
to
Is FileInputDStream returned by fileStream method a reliable receiver?
In the Spark Streaming Guide it says:
There can be two kinds of data sources based on their *reliability*.
Sources (like Kafka and Flume) allow the transferred data to be
acknowledged. If the system receiving data from
Hi Sean,
If you know a stage needs unusually high parallelism for example you can
repartition further for that stage.
The problem is we may don't know whether high parallelism is needed. e.g.
for the join operator, high parallelism may only be necessary for some
dataset that lots of data can
I'm adding this 3rd party library to my Maven pom.xml file so that it's
embedded into the JAR I send to spark-submit:
dependency
groupIdjson-mapreduce/groupId
artifactIdjson-mapreduce/artifactId
version1.0-SNAPSHOT/version
exclusions
exclusion
You may look at https://issues.apache.org/jira/browse/SPARK-4516
Thanks
Best Regards
On Wed, Mar 4, 2015 at 12:25 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I got this error message:
15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block
fetches
For java You can use hive-jdbc connectivity jars to connect to Spark-SQL.
The driver is inside the hive-jdbc Jar.
*http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html
http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html*
From the lines pointed in the exception log, I figured out that my code is
unable to get the spark context. To isolate the problem, I've written a
small code as below -
*import org.apache.spark.SparkConf;*
*import org.apache.spark.SparkContext;*
*public class Test {*
*public static void
Parallelism doesn't really affect the throughput as long as it's:
- not less than the number of available execution slots,
- ... and probably some low multiple of them to even out task size effects
- not so high that the bookkeeping overhead dominates
Although you may need to select different
Hi,
It seems that SparkSQL, even the HiveContext, does not support SQL
statements like : SELECT category, count(1) AS cnt FROM products GROUP BY
category HAVING cnt 10;
I get this exception:
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Unresolved attributes:
Hi,
I guess that toDF() api in spark 1.3 which is required build from source
code?
Patcharee
On 03. mars 2015 13:42, Cheng, Hao wrote:
Using the SchemaRDD / DataFrame API via HiveContext
Assume you're using the latest code, something probably like:
val hc = new HiveContext(sc)
import
Hi,
I have a cluster running on CDH5.2.1 and I have a Mesos cluster (version
0.18.1). Through a Oozie java action I'm want to submit a Spark job to
mesos cluster. Before configuring it as Oozie job I'm testing the java
action from command line and getting exception as below. While running I'm
I think the problem is that you are using an alias in the having clause. I am
not able to try just now but see if HAVING count (*) 2 works ( ie dont use cnt)
Sent on the new Sprint Network from my Samsung Galaxy S®4.
div Original message /divdivFrom: shahab
Why don't you formulate a string before you pass it to the hql function
(appending strings), and hql function is deprecated. You should use sql.
http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur
Looks like you are having 2 netty jars in the classpath.
Thanks
Best Regards
On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
From the lines pointed in the exception log, I figured out that my code is
unable to get the spark context. To isolate
You can try increasing the Akka time out in the config, you can set the
following in your config.
spark.core.connection.ack.wait.timeout: 600
spark.akka.timeout: 1000 (In secs)
spark.akka.frameSize:50
On Wed, Mar 4, 2015 at 5:14 PM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Sorry for the confusion.
All are running Hadoop services. Node 1 is the namenode whereas Nodes 2 and
3 are datanodes.
Best,
Guillaume Guy
* +1 919 - 972 - 8750*
On Sat, Feb 28, 2015 at 1:09 AM, Sean Owen so...@cloudera.com wrote:
Is machine 1 the only one running an HDFS data node? You
99 matches
Mail list logo