Yes it is possible. You need to use jsonfile method on SQL context and then
create a dataframe from the rdd. Then register it as a table. Should be 3
lines of code, thanks to spark.
You may see few YouTube video esp for unifying pipelines.
On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote:
Hi,
I need to use spark to upload a 500 GB data from hadoop on standalone mode
cluster what are the minimum hardware requirements if it's known that it
will be used for advanced analysis (social network analysis)?
--
View this message in context:
Hi,
I am noob to spark and related technology.
i have JSON stored at same location on all worker clients spark cluster). I
am looking to load JSON data set on these clients and do SQL query, like
distributed SQL.
is it possible to achieve?
right now, master submits task to one node only.
I don't know the full context of what you're doing, but serialization
errors usually mean you're attempting to serialize something that can't be
serialized, like the SparkContext. Kryo won't help there.
The arguments to spark-submit you posted previously look good:
2) --num-executors 96
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.
I'm reading in both tables using a HiveContext with the underlying files
stored as Parquet Files. I'm using something along the lines of
great thx
Le sam. 2 mai 2015 à 23:58, Ted Yu yuzhih...@gmail.com a écrit :
This is coming in 1.4.0
https://issues.apache.org/jira/browse/SPARK-7280
On May 2, 2015, at 2:27 PM, Olivier Girardot ssab...@gmail.com wrote:
Sounds like a patch for a drop method...
Le sam. 2 mai 2015 à 21:03,
Yes that's it. If a partition is lost, to recompute it, some steps will
need to be re-executed. Perhaps the map function in which you update the
accumulator.
I think you can do it more safely in a transformation near the action,
where it is less likely that an error will occur (not always
Yes, correct.
However, note that when an accumulator operation is *idempotent*, meaning
that repeated application for the same data behaves exactly like one
application, then that accumulator can be safely called in transformation
steps (non-actions), too.
For example, max and min tracking. Just
How big is the data you're returning to the driver with collectAsMap? You
are probably running out of memory trying to copy too much data back to it.
If you're trying to force a map-side join, Spark can do that for you in
some cases within the regular DataFrame/RDD context. See
IMHO, you are trying waaay to hard to optimize work on what is really a
small data set. 25G, even 250G, is not that much data, especially if you've
spent a month trying to get something to work that should be simple. All
these errors are from optimization attempts.
Kryo is great, but if it's not
See https://issues.apache.org/jira/browse/SPARK-5492 but I think
you'll need to share the stack trace as I'm not sure how this can
happen since the NoSuchMethodError (not NoSuchMethodException)
indicates a call in the bytecode failed to link but there is only a
call by reflection.
On Fri, May 1,
The official document said In transformations, users should be aware of
that each task’s update may be applied more than once if tasks or job stages
are re-executed.
I don't quite understand what is this mean. is that meas if i use the
accumulator in transformations(i.e. map() operation), this
Given the lazy nature of an RDD if you use an accumulator inside a map()
and then you call count and saveAsTextfile over that accumulator will be
called twice. IMHO, accumulators are a bit nondeterministic you need to be
sure when to read them to avoid unexpected re-executions
El 3/5/2015 2:09 p.
Looking at SQLContext.scala (in master branch), jsonFile() returns
DataFrame directly:
def jsonFile(path: String, samplingRatio: Double): DataFrame =
FYI
On Sun, May 3, 2015 at 2:14 AM, ayan guha guha.a...@gmail.com wrote:
Yes it is possible. You need to use jsonfile method on SQL context
Hi,
i am running several jobs in standalone mode and i notice this error in the
log files in some of my nodes at the start of my jobs:
INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for
[TERM, HUP, INT]
INFO spark.SecurityManager: Changing view acls to: root
INFO
“For accumulator updates performed inside actions only, Spark guarantees that
each task’s update to the accumulator will only be applied once, i.e.
restarted tasks will not update the value. In transformations, users should
be aware of that each task’s update may be applied more than once if tasks
Thanks Andrew. What version of HS2 is the SparkSQL thrift server using?
What would be involved in updating? Is it a simple case of increasing the
deep version in one of the project POMs?
Cheers,
~N
On Sat, May 2, 2015 at 11:38 AM, Andrew Lee alee...@hotmail.com wrote:
Hi N,
See:
Hello Dean Others,
Thanks for your suggestions.
I have two data sets and all i want to do is a simple equi join. I have 10G
limit and as my dataset_1 exceeded that it was throwing OOM error. Hence i
switched back to use .join() API instead of map-side broadcast join.
I am repartitioning the data
Note that each JSON object has to be on a single line in the files.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com
On
Hello Deam,
If I don;t use Kryo serializer i got Serialization error and hence am using
it.
If I don';t use partitionBy/reparition then the simply join never completed
even after 7 hours and infact as next step i need to run it against 250G as
that is my full dataset size. Someone here suggested
Friendly reminder on this one. Just wanted to get a confirmation that this
is not by design before I logged a JIRA
Thanks!
Ali
On Tue, Apr 28, 2015 at 9:53 AM, Ali Bajwa ali.ba...@gmail.com wrote:
Hi experts,
Trying to use the slicing functionality in strings as part of a Spark
program
Hi, I am using Spark 1.3.1 to read a directory of about 2000 avro files. The
avro files are from a third party and a few of them are corrupted.
val path = {myDirecotry of avro files}
val sparkConf = new SparkConf().setAppName(avroDemo).setMaster(local) val
sc = new SparkContext(sparkConf)
22 matches
Mail list logo