You can directly write to hbase with Spark. Here's and example for doing
that https://issues.apache.org/jira/browse/SPARK-944
Thanks
Best Regards
On Sat, Feb 14, 2015 at 2:55 PM, Su She suhsheka...@gmail.com wrote:
Hello Akhil, thank you for your continued help!
1) So, if I can write it in
Hi all,
I am new to spark and seem to have hit a common newbie obstacle.
I have a pretty simple setup and job but I am unable to get past this error
when executing a job:
TaskSchedulerImpl: Initial job has not accepted any resources; check your
cluster UI to ensure that workers are registered
http://stackoverflow.com/questions/23527941/how-to-write-to-csv-in-spark
Just read this...seems like it should be easily readable. Thanks!
On Sat, Feb 14, 2015 at 1:36 AM, Su She suhsheka...@gmail.com wrote:
Thanks Akhil for the link. Is there a reason why there is a new directory
created
Simplest way would be to merge the output files at the end of your job like:
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
If you want to do it pro grammatically, then you can use the
FileUtil.copyMerge API
. like:
FileUtil.copyMerge(FileSystem of source(hdfs),
We are planning to use varying servers spec (32 GB, 64GB, 244GB RAM or even
higher and varying cores) for an standalone deployment of spark but we do
not know the spec of the server ahead of time and we need to script up some
logic that will run on the server on boot and automatically set the
AFAIK, this is the expected behavior. You have to make sure that the schema
matches the row. It won't give any error when you apply the schema as it
doesn't validate the nature of data.
--
View this message in context:
Oops! I forgot to excerpt the errors and warnings from that file:
15/02/12 08:02:03 ERROR TaskSchedulerImpl: Lost executor 4 on
compute-0-3.wright: remote Akka client disassociated
15/02/12 08:03:00 WARN TaskSetManager: Lost task 1.0 in stage 28.0 (TID 37,
compute-0-1.wright):
Hello Akhil, thank you for your continued help!
1) So, if I can write it in programitically after every batch, then
technically I should be able to have just the csv files in one directory.
However, can the /desired/output/file.txt be in hdfs? If it is only local,
I am not sure if it will help me
Hi,
I am running 3 mode spark cluster on EMR. While running job I see 1 executor
running? Does that mean only 1 of the node is being used? ( Seems from Spark
Documentation on default mode (LOCAL).
When I switch to yarn-client mode the Spark Web UI doesn't open. How to view
the job running
Thanks Akhil for the link. Is there a reason why there is a new directory
created for each batch? Is this a format that is easily readable by other
applications such as hive/impala?
On Sat, Feb 14, 2015 at 1:28 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can directly write to hbase
Keep in mind that if you repartition to 1 partition, you are only
using 1 task to write the output, and potentially only 1 task to
compute some parent RDDs. You lose parallelism. The
files-in-a-directory output scheme is standard for Hadoop and for a
reason.
Therefore I would consider separating
I'm getting a low performance while parsing json data. My cluster setup is
1.2.0 version of spark with 10 Nodes each having 15Gb of memory and 4 cores.
I tried both scala.util.parsing.json.JSON and and fasterxml's Jackson
parser.
This is what i basically do:
*//Approach 1:*
val jsonStream =
I don't think this is very specific to Spark. Spark is not a desktop
user application. While it is perfectly possible to run it on such a
distro, I would think a server distro is more appropriate.
On Sat, Feb 14, 2015 at 3:05 PM, Joanne Contact joannenetw...@gmail.com wrote:
Hi gurus,
I am
I see. I'd really benchmark how the parsing performs outside Spark (in a
tight loop or something). If *that* is slow, you know it's the parsing. If
not, it's not the parsing.
Another thing you want to look at is CPU usage. If the actual parsing
really is the bottleneck, you should see very high
on yarn you need to first go to the resource manager UI, find your job, and
click the link for the UI there.
From: Puneet Kumar Ojhamailto:puneet.ku...@pubmatic.com
Sent: ?Saturday?, ?February? ?14?, ?2015 ?5?:?25? ?AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Hi,
I am running 3
to get past this you can move the mapper creation code down into the closure.
its then created on the worker node so it doesnt need to be serialized.
// Parse it into a specific case class. We use flatMap to handle errors
// by returning an empty list (None) if we encounter an issue and a
//
Ah my bad, it works without serializable exception. But not much
performance difference is there though.
Thanks
Best Regards
On Sat, Feb 14, 2015 at 7:45 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Thanks for the suggestion, but doing that gives me this exception:
Huh, that would come to 6.5ms per one JSON. That does feel like a lot but
if your JSON file is big enough, I guess you could get that sort of
processing time.
Jackson is more or less the most efficient JSON parser out there, so unless
the Scala API is somehow affecting it, I don't see any better
Hi gurus,
I am trying to install a real linux machine(not VM) where i will install spark
also Hadoop. I plan on learning the clusters.
I found Ubuntu has desktop and server versions. Do it matter?
Thanks!!
J
-
To
(adding back user)
Fair enough. Regarding serialization exception, the hack I use is to have a
object with a transient lazy field, like so:
object Holder extends Serializable {
@transient lazy val mapper = new ObjectMapper()
}
This way, the ObjectMapper will be instantiated at the
Thanks for the suggestion, but doing that gives me this exception:
http://pastebin.com/ni80NqKn
Over this piece of code:
object Holder extends Serializable {
@transient lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
}
Thanks again!
Its with the parser only, just tried the parser
https://gist.github.com/akhld/3948a5d91d218eaf809d without Spark. And it
took me 52 Sec to process 8k json records. Not sure if there's an efficient
way to do this in Spark, i know if i use sparkSQL with schemaRDD and all it
will be
For a beginner Ubuntu Desktop is recommended as it includes a GUI and is easier
to install. Also referServerFaq - Community Help Wiki
| |
| | | | | |
| ServerFaq - Community Help WikiFrequently Asked Questions about the Ubuntu
Server Edition This Frequently Asked Questions document
Yes. Though for good performance it is usually important to make sure that
you have statistics for the smaller dimension tables. Today that can be
done by creating them in the hive metastore and running ANALYZE TABLE
table COMPUTE STATISTICS noscan.
In Spark 1.3 this will happen automatically
I double check the 1.2 feature list and found out that the new sort-based
shuffle manager has nothing to do with HashPartitioner :- Sorry for the
misinformation.
In another hand. This may explain increase in shuffle spill as a side effect
of the new shuffle manager, let me revert
Same problem here, shuffle write increased from 10G to over 64G, since I'm
running on amazon EC2 this always cause temporary folder to consume all the
disk space. Still looking for a solution.
BTW, the 64G shuffle write is encountered on shuffling a pairRDD with
HashPartitioner, so its not
Doing runtime type checking is very expensive, so we only do it when
necessary (i.e. you perform an operation like adding two columns together)
On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote:
AFAIK, this is the expected behavior. You have to make sure that the schema
Hi,
this was not reproduced for me, what kind of jdk are you using for the zinc
server ?
Regards,
Olivier.
2015-02-11 5:08 GMT+01:00 Yi Tian tianyi.asiai...@gmail.com:
Hi, all
I got an ERROR when I build spark master branch with maven (commit:
2d1e916730492f5d61b97da6c483d3223ca44315)
Would it make sense to add an optional validate parameter to applySchema()
which defaults to False, both to give users the option to check the schema
immediately and to make the default behavior clearer?
On Sat Feb 14 2015 at 9:18:59 AM Michael Armbrust mich...@databricks.com
wrote:
Doing
Thanks Sean and Akhil! I will take out the repartition(1). Please let me
know if I understood this correctly, Spark Streamingwrites data like this:
foo-1001.csv/part -x, part-x
foo-1002.csv/part -x, part-x
When I see this on Hue, the csv's appear to me as *directories*,
Okay, got it, thanks for the help Sean!
On Sat, Feb 14, 2015 at 1:08 PM, Sean Owen so...@cloudera.com wrote:
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a directory of these files.
On Sat, Feb 14, 2015 at 9:05 PM, Su She suhsheka...@gmail.com wrote:
Thanks Sean and
32 matches
Mail list logo