I am trying to import the spark csv package while using the scala spark
shell. Spark 1.4.1, Scala 2.11
I am starting the shell with:
bin/spark-shell --packages com.databricks:spark-csv_2.11:1.1.0 --jars
../sjars/spark-csv_2.11-1.1.0.jar --master local
I then try and run
and get the
The command you ran and the error you got were not visible.
Mind sending them again ?
Cheers
On Sun, Aug 2, 2015 at 8:33 PM, billchambers wchamb...@ischool.berkeley.edu
wrote:
I am trying to import the spark csv package while using the scala spark
shell. Spark 1.4.1, Scala 2.11
I am
Commands again are:
Sure the commands are:
scala val df =
sqlContext.read.format(com.databricks.spark.csv).option(header,
true).load(cars.csv)
and get the following error:
java.lang.RuntimeException: Failed to load class for data source:
com.databricks.spark.csv
at
I tried the following command on master branch:
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --jars
../spark-csv_2.10-1.0.3.jar --master local
I didn't reproduce the error with your command.
FYI
On Sun, Aug 2, 2015 at 8:57 PM, Bill Chambers
wchamb...@ischool.berkeley.edu
Hi,
I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other process is
modifying the hdfs file. Any idea, what might be the cause of this?
15/08/02 21:24:13 ERROR scheduler.DAGSchedulerEventProcessLoop:
Yes, I forgot to mention
I chose prime number as a modulo for hash function because my keys are usually
strings and spark calculates particular partitiion using key hash(see
HashPartitioner.scala) So, to avoid big number of collisions(when many keys
located in few partition) it is common to use
Hi,
reducing spark.storage.memoryFraction did the trick for me. Heap doesn't
get filled because it is reserved..
My reasoning is:
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can.
storage.memoryFraction is in a sense
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA
to add this feature and may be in future release it could be added.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly moreill...@qub.ac.uk
wrote:
Hi,
I am currently working on the latest version of
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it
is still in heap memory.
-- --
??: Barak Gitsis;bar...@similarweb.com;
:
I guess it goes through that 500k files
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L193for
the first time and then use a filter from next time.
Thanks
Best Regards
On Fri, Jul 31, 2015 at 4:39 AM, Tathagata Das
LOL Brandon!
@ziqiu See http://spark.apache.org/community.html
You need to send an email to user-unsubscr...@spark.apache.org
Thanks
Best Regards
On Fri, Jul 31, 2015 at 2:06 AM, Brandon White bwwintheho...@gmail.com
wrote:
https://www.youtube.com/watch?v=JncgoPKklVE
On Thu, Jul 30, 2015
spark uses a lot more than heap memory, it is the expected behavior.
in 1.4 off-heap memory usage is supposed to grow in comparison to 1.3
Better use as little memory as you can for heap, and since you are not
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you
hi community,
i have run my k-means spark application on 1million data points. the
program works, but no output in the hdfs is generated. when it runs on
10.000 points, a output is written.
maybe someone has an idea?
best regards,
paul
Can you provide some more detai:
release of Spark you're using
were you running in standalone or YARN cluster mode
have you checked driver log ?
Cheers
On Sun, Aug 2, 2015 at 7:04 AM, Pa Rö paul.roewer1...@googlemail.com
wrote:
hi community,
i have run my k-means spark application on
I agree with Ted. Could you please post the log file?
On Aug 2, 2015 10:13 AM, Ted Yu yuzhih...@gmail.com wrote:
Can you provide some more detai:
release of Spark you're using
were you running in standalone or YARN cluster mode
have you checked driver log ?
Cheers
On Sun, Aug 2, 2015 at
I think you use case can already be implemented with HDFS encryption and/or
SealedObject, if you look for sth like Altibase.
If you create a JIRA you may want to set the bar a little bit higher and
propose sth like MIT cryptdb: https://css.csail.mit.edu/cryptdb/
Le ven. 31 juil. 2015 à 10:17,
spark.storage.memoryFraction is in heap memory, but my situation is that the
memory is more than heap memory !
Anyone else use spark 1.4.1 in production?
-- --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??:
What kind of cluster? How many cores on each worker? Is there config for
http solr client? I remember standard httpclient has limit per route/host.
On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote:
No one has any ideas?
Is there some more information I should provide?
I am
This may seem like a silly question… but in following Mark’s link, the
presentation talks about the TPC-DS benchmark.
Here’s my question… what benchmark results?
If you go over to the TPC.org http://tpc.org/ website they have no TPC-DS
benchmarks listed.
(Either audited or unaudited)
So
I'm trying to process a bunch of large json log files with spark, but it
fails every time with `scala.MatchError`, Whether I give it schema or not.
I just want to skip lines that does not match schema, but I can't find how
in docs of spark.
I know write a json parser and map it to json file RDD
No one has any ideas?
Is there some more information I should provide?
I am looking for ways to increase the parallelism among workers. Currently
I just see number of simultaneous connections to Solr equal to the number
of workers. My number of partitions is (2.5x) larger than number of
workers,
so how many cores you configure per node?
do u have something like total-executor-cores or maybe
--num-executors config(I'm
not sure what kind of cluster databricks platform provides, if it's
standalone then first option should be used)? if you have 4 cores at total,
then even though you have
Hi Igor,
The cluster is a Databricks Spark cluster. It consists of 1 master + 4
workers, each worker has 60GB RAM and 4 CPUs. The original mail has some
more details (also the reference to the HttpSolrClient in there should be
HttpSolrServer, sorry about that, mistake while writing the email).
Can you share the transformations up to the foreachPartition?
From: Sujit Palmailto:sujitatgt...@gmail.com
Sent: 8/2/2015 4:42 PM
To: Igor Bermanmailto:igor.ber...@gmail.com
Cc: usermailto:user@spark.apache.org
Subject: Re: How to increase parallelism of a Spark
Hi,
This might be a long shot, but has anybody run into very poor predictive
performance using RandomForest with Mllib? Here is what I'm doing:
- Spark 1.4.1 with PySpark
- Python 3.4.2
- ~30,000 Tweets of text
- 12289 1s and 15956 0s
- Whitespace tokenization and then hashing trick for feature
I don't know if (your assertion/expectation that) workers will process things
(multiple partitions) in parallel is really valid. Or if having more partitions
than workers will necessarily help (unless you are memory bound - so partitions
is essentially helping your work size rather than
spark uses a lot more than heap memory, it is the expected behavior. It
didn't exist in spark 1.3.x
What does a lot more than means? It means that I lose control of it!
I try to apply 31g, but it still grows to 55g and continues to grow!!! That is
the point!
I have tried set memoryFraction to
What do the master logs show?
Best Regards,
Sonal
Founder, Nube Technologies
http://t.sidekickopen13.com/e1t/c/5/f18dQhb0S7lC8dDMPbW2n0x6l2B9nMJW7t5XZs1pNkJdVdDLZW1q7zBxW64k9XR56dLFLf58_ZT802?t=http%3A%2F%2Fwww.nubetech.co%2Fsi=5462006004973568pi=903294d1-e4a2-4926-cf03-b51cc168cfc1
Check out
On 1 Aug 2015, at 18:26, Ruslan Dautkhanov
dautkha...@gmail.commailto:dautkha...@gmail.com wrote:
If your network is bandwidth-bound, you'll see setting jumbo frames (MTU 9000)
may increase bandwidth up to ~20%.
On 2 Aug 2015, at 13:42, Sujit Pal
sujitatgt...@gmail.commailto:sujitatgt...@gmail.com wrote:
There is no additional configuration on the external Solr host from my code, I
am using the default HttpClient provided by HttpSolrServer. According to the
Javadocs, you can pass in a HttpClient
30 matches
Mail list logo