One approach would be to write pure mapReduce and spark jobs (eg like
wordcounts, filter, join, groupBy etc) and benchmark them. Another would be
to pick something that runs on top of mapReduce/Spark and benchmark on it.
(like benchmark against hive and sparkSQL)
Thanks
Best Regards
On Mon, Oct
Hi Ken,
AFAIK, you can specify the following in the spark-env.sh file across the
cluster
export HIVE_HOME=/path/to/hive/
export HIVE_CONF_DIR=/path/to/hive/conf/
And it is not necessary for the worker node to have access to the hive's
metastore dir.
--
View this message in
Hey Cheng,
Right now we aren't using stable API's to communicate with the Hive
Metastore. We didn't want to drop support for Hive 0.12 so right now
we are using a shim layer to support compiling for 0.12 and 0.13. This
is very costly to maintain.
If Hive has a stable meta-data API for talking to
You can check this project out
https://github.com/sigmoidanalytics/spork-streaming/ (it is a bit
outdated, but works) It is basically the integration of Pig on
SparkStreaming. You can write pig scripts and they are underneath executed
as spark streaming job. To get you started quickly, have a
It might kind of work, but you are effectively making all of your
workers into mini, separate Spark drivers in their own right. This
might cause snags down the line as this isn't the normal thing to do.
On Tue, Oct 28, 2014 at 12:11 AM, Localhost shell
universal.localh...@gmail.com wrote:
Hey
You can use sparkstreaming to get the transactions from those TCP Connections
periodically and you can push the data into HBase accordingly. Now,
regarding the querying part, you can use a database like redis which
actually does the key, value storing for you. You can use the RDDs to query
Hi,
I have a standalone spark , where the executor is set to have 6.3 G memory
, as I am using two workers so in total there 12.6 G memory and 4 cores.
I am trying to cache a RDD with approximate size of 3.2 G, but apparently
it is not cached as neither I can seeBlockManagerMasterActor:
What setting you are using for
persist() or cache()
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence
On Tue, Oct 28, 2014 at 6:18 PM, shahab shahab.mok...@gmail.com wrote:
Hi,
I have a standalone spark , where the executor is set to have 6.3 G memory
, as I am
Oops, the reference for the above code:
http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu chengi.liu...@gmail.com
wrote:
Hi,
I have three rdds.. X,y and p
X is matrix rdd (mXn), y
Hi,
I have three rdds.. X,y and p
X is matrix rdd (mXn), y is (mX1) dimension vector
and p is (mX1) dimension probability vector.
Now, I am trying to sample k rows from X and corresponding entries in y
based on probability vector p.
Here is the python implementation
import randomfrom bisect
Did you just call cache()? By itself it does nothing but once an action
requires it to be computed it should become cached.
On Oct 28, 2014 8:19 AM, shahab shahab.mok...@gmail.com wrote:
Hi,
I have a standalone spark , where the executor is set to have 6.3 G memory
, as I am using two workers
_cumm = [p[0]]
for i in range(1, len(p)):
_cumm.append(_cumm[-1] + p[i])
index = set([bisect(_cumm, random.random()) for i in range(k)])
chosed_x = X.zipWithIndex().filter(lambda (v, i): i in
index).map(lambda (v, i): v)
chosed_y = [v for i, v
Is there an equivalent way of doing the following:
a = [1,2,3,4]
reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:]
??
The issue with above suggestion is that population is a hefty data
structure :-/
On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu dav...@databricks.com wrote:
_cumm = [p[0]]
Dear Sir/Madam,
This is Songtao, live in Singapore, doing some research with big data projects
in NUS.
I want to be an organiser for Singapore Meetup.
Thanks.
Songao
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
I used Cache followed by a count on RDD to ensure that caching is
performed.
val rdd = srdd.flatMap(mapProfile_To_Sessions).cache
val count = rdd.count
//so at this point RDD should be cahed ? right?
On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen so...@cloudera.com wrote:
Did you just call
Hi,
i am submitting spark application in the following fashion
bin/spark-submit --class NetworkCount --master spark://abc.test.com:7077
try/simple-project/target/simple-project-1.0-jar-with-dependencies.jar
But is there any other way to submit spark application through the code?
like for
Hi tdas, is it possible to run spark 24/7, i am using updateStateByKey and i
am streaming 3lac records in 1/2 hr, i am not getting the correct result
also i am not not able to run spark streaming for 24/7 after hew hrs i get
array out of bound exception even if i am not streaming anything? btw
I am having the same issue, i am using update stateBykey and over a period a
set of data will not change i would like save it and delete it from state..
have you found the answer? please share your views. Thanks for your time
--
View this message in context:
How about directly running it?
val ssc = new StreamingContext(local[2],Network
WordCount,Seconds(5),
/home/akhld/mobi/localclusterxx/spark-1)
val lines=ssc.socketTextStream(localhost, 12345)
val words = lines.flatMap(_.split( ))
val wordCounts = words.map(x = (x,
Hello
I am trying to reduce the number of java threads (about 80 on my system) to as
few as there can be.
What settings can be done in spark-1.1.0/conf/spark-env.sh ? (or other places
as well)
I am also using hadoop for storing data on hdfs
Thank you,
Wanda
I mean updating the spark conf not only in the driver, but also in the
Spark Workers.
Because the driver configurations cannot be read by the Executors, they
still use the default spark.io.compression.codec to deserialize the tasks.
Best Regards,
Shixiong Zhu
2014-10-28 16:39 GMT+08:00 buring
I seem to recall there were some specific requirements on how to import the
implicits.
Here is the issue:
scala import org.apache.spark.mllib.rdd.RDDFunctions._
console:10: error: object RDDFunctions in package rdd cannot be accessed
in package org.apache.spark.mllib.rdd
import
Although nobody answers, as I tested, Row, MutableValue and there
subclasses are not registered by default,
which I think should be, since they would absolutely show up in Spark SQL.
2014-10-26 23:43 GMT+08:00 Fengyun RAO raofeng...@gmail.com:
In Tuning Spark
Hi
I know we can create spark context with new JavaStreamingContext(master,
appName, batchDuration, sparkHome, jarFile)
but to run the application we will have to use
spark-home/spark-submit --class NetworkCount
i want skip submitting manually, i wanted to invoke this spark app when a
What is the motivation behind this ?
You can start with master as local[NO_OF_THREADS]. Reducing the threads at
all other places can have unexpected results. Take a look at this.
http://spark.apache.org/docs/latest/configuration.html.
Prashant Sharma
On Tue, Oct 28, 2014 at 2:08 PM, Wanda
I am trying to get a software trace and I need to get the number of active
threads as low as I can in order to inspect the active part of the workload
From: Prashant Sharma scrapco...@gmail.com
To: Wanda Hawk wanda_haw...@yahoo.com
Cc: user@spark.apache.org
Because that org.apache.spark.mllib.rdd.RDDFunctions._ is mllib private
class, it can only be called by function in mllib.
2014-10-28 17:09 GMT+08:00 Stephen Boesch java...@gmail.com:
I seem to recall there were some specific requirements on how to import
the implicits.
Here is the issue:
Hi,
I am currently struggling with how to properly set Spark to perform only one
map, flatMap, etc at once. In other words my map uses multi core algorithm so I
would like to have only one map running to be able to use all the machine cores.
Thank you in advance for advices and replies.
Jan
HI Yanbo,
That is not the issue: notice that importing the object is fine:
scala import org.apache.spark.mllib.rdd.RDDFunctions
import org.apache.spark.mllib.rdd.RDDFunctions
scala import org.apache.spark.mllib.rdd.RDDFunctions._
console:11: error: object RDDFunctions in package rdd cannot be
I had an offline with Akhil, but this issue is still not resolved.
2014-10-24 0:18 GMT-07:00 Akhil Das ak...@sigmoidanalytics.com:
Make sure the guava jar
http://mvnrepository.com/artifact/com.google.guava/guava/12.0 is
present in the classpath.
Thanks
Best Regards
On Thu, Oct 23, 2014
Hi,friends:
I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails when
data is large. So how to tune it ?
spark-defaults.conf:
spark.shuffle.consolidateFiles true
spark.shuffle.manager SORT
spark.akka.threads 4
spark.sql.inMemoryColumnarStorage.compressed
Currently we are using Hive in some products, however, seems maybe Spark SQL
is a better choice. Is there any official comparation between them? Thanks a
lot!
--
View this message in context:
Hi Paolo,
The custom classes and jars are distributed across the Spark cluster via an
HTTP server on the master when the absolute path of the application fat jar is
specified in the spark-submit script. The Advanced Dependency Management
section on
Which version of Spark and Hadoop are you using? Could you please provide
the full stack trace of the exception?
On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi,
I was trying to set up Spark SQL on a private cluster. I configured a
hive-site.xml under
Hi,
I am running a stand alone Spark cluster, 2 workers each has 2 cores.
I submit one Spakr application to the cluster, and I monitor the execution
process via UI ( both worker-ip:8081 and master-ip:4040)
There I can see that the application is handled by many Executors, in my
case one worker
You can refer the compare between different sql on hadoop solution such as
hive, spark sql, shark, impala and so on.
There are two main works which may be not very objectively, for your
reference:
Cloudera benchmark:
Hi,
I got the following exceptions when using Spray client to write to OpenTSDB
using its REST API.
Exception in thread pool-10-thread-2 java.lang.NoSuchMethodError:
akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext;
It worked locally in my Intellij but failed when I
Hi,
I have downloaded the binary spark distribution.
When building the package with sbt package I get the following:
[root@nlvora157 ~]# sbt package
[info] Set current project to Simple Project (in build file:/root/)
[info] Updating {file:/root/}root...
[info] Resolving
Maybe you had wrong configuration of sbt proxy.
2014-10-28 18:27 GMT+08:00 nl19856 hanspeter.sl...@gmail.com:
Hi,
I have downloaded the binary spark distribution.
When building the package with sbt package I get the following:
[root@nlvora157 ~]# sbt package
[info] Set current project to
Sigh!
Sorry I did not read the error message properly.
2014-10-28 11:39 GMT+01:00 Yanbo Liang [via Apache Spark User List]
ml-node+s1001560n17478...@n3.nabble.com:
Maybe you had wrong configuration of sbt proxy.
2014-10-28 18:27 GMT+08:00 nl19856 [hidden email]
Your proxy/dns could be blocking it.
Thanks
Best Regards
On Tue, Oct 28, 2014 at 4:06 PM, Yanbo Liang yanboha...@gmail.com wrote:
Maybe you had wrong configuration of sbt proxy.
2014-10-28 18:27 GMT+08:00 nl19856 hanspeter.sl...@gmail.com:
Hi,
I have downloaded the binary spark
Add my message.
On Tue, Oct 28, 2014 at 3:22 PM, Sasi [via Apache Spark User List]
ml-node+s1001560n17471...@n3.nabble.com wrote:
Thank you Akhil. You are correct it's about overlapped thrift libraries.
We have taken reference from
The number of tasks is decided by the input partition numbers.
If you want only one map or flatMap at once, just call coalesce() or
repartition() to associate data into one partition.
However, this is not recommend because it was not executed parallel
efficiently.
2014-10-28 17:27 GMT+08:00
Try to increase the driver memory.
2014-10-28 17:33 GMT+08:00 Zhanfeng Huo huozhanf...@gmail.com:
Hi,friends:
I use spark(spark 1.1) sql operate data in hive-0.12, and the job fails
when data is large. So how to tune it ?
spark-defaults.conf:
spark.shuffle.consolidateFiles true
Hi Arpit,
To try this:
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK)
Best,
Yifan LI
On 28 Oct 2014, at 11:17, Arpit Kumar arp8...@gmail.com
But I guess that this makes only one task over all the clusters nodes. I would
like to run several tasks, but I would like Spark to not run more than one map
at each of my nodes at one time. That means I would like to let's say have 4
different tasks and 2 nodes where each node has 2 cores.
Hi Yifan LI,
I am currently working on Spark 1.0 in which we can't pass edgeStorageLevel
as parameter. It implicitly caches the edges. So I am looking for a
workaround.
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.graphx.GraphLoader$
Regards,
Arpit
On Tue, Oct 28,
Hi,
I have cloned sparked as:
git clone g...@github.com:apache/spark.git
cd spark
sbt/sbt compile
Everything seems to go smooth until :
[info] downloading
https://repo1.maven.org/maven2/org/ow2/asm/asm-tree/5.0.3/asm-tree-5.0.3.jar
...
[info] [SUCCESSFUL ]
It's not very difficult to implement by properly set parameter of
application.
Some basic knowledge you should know:
An application can have only one executor at each machine or container
(YARN).
So you just set executor-cores as 1, then each executor will make only one
task at once.
2014-10-28
Hi,
Did anyone tried to replace gigaspaces implementation of master worker with
spark standalone or hadoop driven implementation...
I guess I am looking to find out what are pros and cons and if people tried
it on the production side (grid or hadoop)
Regards,
--
Aleksandar Kacanski
An application can have only one executor at each machine or container
(YARN).
How many thread that each executor have is determined by the parameter
executor-cores.
There are also other parameter setting method that you can specify total-
executor-cores and each executor cores will be determined
Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not
use any method in this class or even new an object of this class. So I
infer that if you import org.apache.spark.mllib.rdd.RDDFunctions._, it may
call some method of that object.
2014-10-28 17:29 GMT+08:00 Stephen Boesch
Hi Flavio,
Doing batch += ... shouldn't work. It will create new batch for each
element in the myRDD (also val initializes an immutable variable, var is
for mutable variables). You can use something like accumulators
http://spark.apache.org/docs/latest/programming-guide.html#accumulators.
val
I am not sure if it can work on Spark 1.0, but give it a try.
or, Maybe you can try:
1) to construct the edges and vertices RDDs respectively with desired storage
level.
2) then, to obtain a graph by using Graph(verticesRDD, edgesRDD).
Best,
Yifan LI
On 28 Oct 2014, at 12:10, Arpit Kumar
Dear All,
I am using connected components function of GraphX (on Spark 1.0.2) on some
graph. However for some reason the fails with StackOverflowError. The graph
is not too big; it contains 1 vertices and 50 edges. Can any one
help me to avoid this error? Below is the output of Spark:
Is this what are you looking for ?
In Shark, default reducer number is 1 and is controlled by the property
mapred.reduce.tasks. Spark SQL deprecates this property in favor
ofspark.sql.shuffle.partitions, whose default value is 200. Users may customize
this property via SET:
SET
I am trying to setup Apache-Spark on a small standalone cluster (1 Master
Node and 8 Slave Nodes). I have installed the pre-built version of spark
1.1.0 built on top of Hadoop 2.4. I have set up the passwordless ssh between
nodes and exported a few necessary environment variables. One of these
spark can definitely very quickly answer queries like give me all
transactions with property x. and you can put a http query server in front
of it and run queries concurrently.
but spark does not support inserts, updates, or fast random access lookups.
this is because RDDs are immutable and
Thanks for the useful comment. But I guess this setting applies only when I
use SparkSQL right= is there any similar settings for Spark?
best,
/Shahab
On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wanda_haw...@yahoo.com wrote:
Is this what are you looking for ?
In Shark, default reducer
Can you please elaborate, I didn't get what you intended for me to read in
that link.
Regards.
On Mon, Oct 20, 2014 at 7:03 PM, Saurabh Wadhawan
saurabh.wadha...@guavus.com wrote:
What about:
Hi there, I'm trying to use Window operations on streaming, but everything
I perform a windowed computation, I stop getting results.
For example:
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
Will print the output to the stdout on 'batch duration' interval. Now if I
replace it
We have some very large datasets where the calculation converge on a result.
Our current implementation allows us to track how quickly the calculations
are converging and end the processing early. This can significantly speed up
some of our processing.
Is there a way to do the same thing is
Dear Spark Community,
Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?
Using PySpark I can create a parallelized file with
rdd=sc.parallelize([('key1', 1.0)]) and I can save it as a sequencefile
with rdd.saveAsSequenceFile(). But how can I put the whole
Thanks Akhil,
So there is no tool that I can use right? My program is overloading some
operators for some operation on images. I need to be accurate in the result.
I try to work on your offered approach.
Thanks.
--
View this message in context:
Is there a repo or some kind of instruction about how to install sbt for centos?
Thanks,
Hello,
I have a Spark app which I run with master local[3]. When running without
any persist calls, it seems to work fine, but as soon as I add persist calls
(at default storage level), it fails at the first persist call with the message
below. Unfortunately, I can't post the code.
Hello,
Given the following example customers.json file:
{
name: Sherlock Holmes,
customerNumber: 12345,
address: {
street: 221b Baker Street,
city: London,
zipcode: NW1 6XE,
country: United Kingdom
}
},
{
name: Big Bird,
customerNumber: 10001,
address: {
street: 123 Sesame Street,
city:
Hi all - I've simplified the code so now I'm literally feeding in 200
million ratings directly to ALS.train. Nothing else is happening in the
program.
I've also tried with both the regular serializer and the KryoSerializer.
With Kryo, I get the same ArrayIndex exceptions.
With the regular
You should use foreachPartition, and take care to open and close your
connection following the pattern described in:
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E
Within a partition, you iterate over
Have you read this ?
http://lancegatlin.org/tech/centos-6-install-sbt
On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto rpagli...@appcomsci.com
wrote:
Is there a repo or some kind of instruction about how to install sbt for
centos?
Thanks,
If you're just calling sbt from within the spark/sbt folder, it should
download and install automatically.
Nick
2014년 10월 28일 화요일, Ted Yuyuzhih...@gmail.com님이 작성한 메시지:
Have you read this ?
http://lancegatlin.org/tech/centos-6-install-sbt
On Tue, Oct 28, 2014 at 7:54 AM, Pagliari, Roberto
sbt is just a jar file. So you really don't need to install anything. Once
you run the jar file (sbt-launch.jar) it can download the required
dependencies.
I use an executable script called sbt that has the following contents.
SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled
Hi all,
I'm having trouble troubleshooting this particular block of code for Spark
Streaming and saving to Cassandra:
val lines = ssc.socketTextStream(args(0), args(1).toInt,
StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split( ))
val wordCounts = words.map(x = (x,
The following line of code is indicating the constructor is not defined. The
only examples I can find of usage of JdbcRDD is Scala examples. Does this work
in Java? Is there any examples? Thanks.
JdbcRDDInteger rdd = new JdbcRDDInteger(sp, () -
ods.getConnection(), sql,
Looks like you're having some classpath issues.
Are you providing your spark-cassandra-driver classes to your job?
sparkConf.setJars(Seq(jars...)) ?
On Tue, Oct 28, 2014 at 5:34 PM, Harold Nguyen har...@nexgate.com wrote:
Hi all,
I'm having trouble troubleshooting this particular block of
Hi Csaba,
It sounds like the API you are looking for is sc.wholeTextFiles :)
Cheers,
Holden :)
On Tuesday, October 28, 2014, Csaba Ragany rag...@gmail.com wrote:
Dear Spark Community,
Is it possible to convert text files (.log or .txt files) into
sequencefiles in Python?
Using PySpark I
In Spark, certain functions have an optional parameter to determine the
number of partitions (distinct, textFile, etc..). You can also use the
coalesce () or repartiton() functions to change the number of partitions
for your RDD. Thanks.
On Oct 28, 2014 9:58 AM, shahab shahab.mok...@gmail.com
So, Im using Intellij 13.x, and Scala Spark jobs.
Make sure you have singletons (objects, not classes), then simply debug the
main function. You’ll need to set your master to some derivation of “local”,
but thats it. Spark Streaming is kinda wonky when debugging, but data-at-rest
behaves
Also, I'm following to master students at the University of Liège (one for
computing prob conditional density on massive data and the other
implementing a Markov Chain method on georasters), I proposed them to use
the Spark-Notebook to learn the framework, they're quite happy with it (so
far at
doing cleanup in an iterator like that assumes the iterator always gets
fully read, which is not necessary the case (for example RDD.take does not).
instead i would use mapPartitionsWithContext, in which case you can write a
function of the form.
f: (TaskContext, Iterator[T]) = Iterator[U]
now
I’ve been puzzled by this lately. I too would like to use the thrift server to
provide JDBC style access to datasets via SparkSQL. Is this possible? The
examples show temp tables created during the lifetime of a SparkContext. I
assume I can use SparkSQL to query those tables while the
Can this be done? Can I just spin up a SparkContext programmatically, point
this to my yarn-cluster and this works like spark-submit?? Doesn’t (at least)
the application JAR need to be distributed to the workers via HDFS or the like
for the jobs to run?
mn
On Oct 28, 2014, at 2:29 AM,
the spark tutorial shows that we can create a stream that reads new files
from a directory.
that seems to have some lag time, as we have to write the data to file first
and then wait until spark stream picks it up.
what is the best way to implement REAL 'REAL-TIME' streaming for analysis in
a REAL TIME stream, by definition, delivers data every X seconds. you can
easily do this with spark. roughly here is the way to create a stream
gobbler and attach a spark app to read its data every X seconds
- Write a Runnable thread which reads data from a source. Test that it
works
thanks jay. do you think spark is a good fit for handling streaming
analyzing videos in real time? in this case, we're streaming 30 frames per
second, and each frame is an image (size: roughly 500K - 1MB). we need to
analyze every frame and return the analysis result back instantly in real
That declaration looks OK for Java 8, at least when I tried it just
now vs master. The only thing I see wrong here is getInt throws an
exception which means the lambda has to be more complicated than this.
This is Java code here calling the constructor so yes it can work fine
from Java (8).
On
http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-Storm-tp9118p17530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Here's the answer I got from Akka's user ML.
This looks like a binary incompatibility issue. As far as I know Spark is
using a custom built Akka and Scala for various reasons.
You should ask this on the Spark mailing list, Akka is binary compatible
between major versions (2.3.6 is compatible
Hi all,
The following works fine when submitting dependency jars through
Spark-Shell:
./bin/spark-shell --master spark://ip-172-31-38-112:7077 --jars
Hey Jim,
There are some experimental (unstable) API's that support running jobs
which might short-circuit:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1126
This can be used for doing online aggregations like you are
describing. And in one
I haven't learned Scala yet so as you might imagine I'm having challenges
working with Spark from the Java API. For one thing, it seems very limited in
comparison to Scala. I ran into a problem really quick. I need to hydrate an
RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that
I am using Hadoop 2.5.0.3 and spark 1.1. My local hive version is 0.12.3 the
hcatalog.jar of which is included in the path. The stack trace is as follows:
14/10/28 18:24:24 WARN ipc.Client: Exception encountered while connecting to
the server :
Hi oliver,
thanks for the answer I don't have the information of all keys before hand,
the reason i want to have multiple tables is that based on my information on
known key I will apply different queries get the results for that particular
key I don't want to touch the unkown ones I'll save that
If I put all the jar files from my local hive in the front of the spark class
path, a different error was reported, as follows:
14/10/28 18:29:40 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: PLAIN auth failed: null
at
Hi Ron,
what ever api you have in scala you can possibly use it form java. scala is
inter-operable with java and vice versa. scala being both object oriented
and functional will make your job easier on jvm and it is more consise than
java. Take it as an opportunity and start learning scala ;).
Any suggestions guys??
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Harold,
It seems like, based on your previous post, you are using one version of the
connector as a dependency yet building the assembly jar from master? You were
using 1.1.0-alpha3 (you can upgrade to alpha4, beta coming this week) yet your
assembly is
A pretty large fraction of users use Java, but a few features are still not
available in it. JdbcRDD is one of them -- this functionality will likely be
superseded by Spark SQL when we add JDBC as a data source. In the meantime, to
use it, I'd recommend writing a class in Scala that has
I interpret this to mean you have to learn Scala in order to work with Spark in
Scala (goes without saying) and also to work with Spark in Java (since you have
to jump through some hoops for basic functionality).
The best path here is to take this as a learning opportunity and sit down and
The overridable methods of RDD are marked as @DeveloperApi, which means that
these are internal APIs used by people that might want to extend Spark, but are
not guaranteed to remain stable across Spark versions (unlike Spark's public
APIs).
BTW, if you want a way to do this that does not
1 - 100 of 157 matches
Mail list logo