Hello Manolis,
I'm a new subscriber to this mailing list as well and I read on the Apache
web-page that one can begin with following these mailing lists and help out
other new users by pointing them to the right documentation or maybe go
through some documentation yourself in order to answer
I don’t think it’s just memory overhead. It might be better to use an execute
with lesser heap space(30GB?). 46 GB would mean more data load into memory and
more GC, which can cause issues.
Also, have you tried to persist data in any way? If so, then that might be
causing an issue.
Lastly, I
Hi all ,
My name is Manolis Gemeliaris , I'm a software engineering student and I'm
willing to contribute to the Apache Spark Project. I don't have any prior
experience with contributing to open source.
I have prior experience with Java , R (just a little) and Python (just a
little) and
We have a 8 node Cassandra Cluster. Replication Strategy: 3 Consistency
Level Quorum. Data Spread: I can let you know once I get access to our
production cluster.
The use case for simple count is more for internal use than say end
clients/customers however there are many uses cases from customers
I am not sure what use case you want to demonstrate with select count in
general. Maybe you can elaborate more what your use case is.
Aside from this: this is a Cassandra issue. What is the setup of Cassandra?
Dedicated nodes? How many? Replication strategy? Consistency configuration? How
is
I the above setup my executors start one docker container per task. Some of
these containers grow in memory as data is piped. Eventually there is not
enough memory on the machine for docker containers to run (since YARN
already started its containers), and everything starts failing.
The way I'm
So if the process your communicating with from Spark isn't launched inside
of its YARN container then it shouldn't be an issue - although it sounds
like you maybe have multiple resource managers on the same machine which
can sometimes lead to interesting/difficult states.
On Thu, Nov 24, 2016 at
Ok, that makes sense for processes directly launched via fork or exec from
the task.
However, in my case the nd that starts docker daemon starts the new
process. This process runs in a docker container. Will the container use
memory from YARN executor memory overhead, as well? How will YARN know
Hi, I'm trying to broadcast a map of 2.6GB but I'm getting a weird Kryo
exception.
I tried to set -XX:hashCode=0 in executor and driver, following this
copmment:
https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808
But it didn't change anything.
Are you aware of this
some accurate numbers here. so it took me 1hr:30 mins to count 698705723
rows (~700 Million)
and my code is just this
sc.cassandraTable("cuneiform", "blocks").cassandraCount
On Thu, Nov 24, 2016 at 10:48 AM, kant kodali wrote:
> Take a look at this
Try setting spark.yarn.executor.memoryOverhead 1
On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Hi Spark users
>
> I am running a job that does join of a huge dataset (7 TB+) and the
> executors keep crashing randomly, eventually causing the job to
Take a look at this https://github.com/brianmhess/cassandra-count
Now It is just matter of incorporating it into spark-cassandra-connector I
guess.
On Thu, Nov 24, 2016 at 1:01 AM, kant kodali wrote:
> According to this link https://github.com/datastax/
>
Dataset/dataframes will use direct/raw/off-heap memory in the most
efficient columnar fashion. Trying to fit the same amount of data in heap
memory would likely increase your memory requirement and decrease the
speed.
So, in short, don't worry about it and increase overhead. You can also set
a
Hi, The source file i have is on local machine and its pretty huge like 150
gb. How to go about it?
On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran
wrote:
>
> On 19 Nov 2016, at 17:21, vr spark wrote:
>
> Hi,
> I am looking for scala or python
here is the scala code I use to get the best model, I never used java
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new
RegressionEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = cv.fit(data)
val plmodel = cvModel.bestModel.asInstanceOf[PipelineModel]
scala codes are also for me, if there is some solution .
On Friday, November 25, 2016 1:27 AM, Zhiliang Zhu
wrote:
Hi All,
Here want to print the specific tree or forest structure from pipeline model.
However, it seems that here met more issue about
Hi All,
Here want to print the specific tree or forest structure from pipeline model.
However, it seems that here met more issue about XXXClassifier and
XXXClassificationModel,
as the codes below:
... GBTClassifier gbtModel = new GBTClassifier(); ParamMap[]
grid = new
Hi Xiaomeng,
Thanks very much for your comment, which is helpful for me.
However, it seems that here met more issue about XXXClassifier and
XXXClassificationModel,as the codes below:
... GBTClassifier gbtModel = new GBTClassifier(); ParamMap[]
grid = new ParamGridBuilder()
Hi Spark users
I am running a job that does join of a huge dataset (7 TB+) and the
executors keep crashing randomly, eventually causing the job to crash.
There are no out of memory exceptions in the log and looking at the dmesg
output, it seems like the OS killed the JVM because of high memory
Hi,
Not sure whether it is right place to discuss this issue.
I am running following Hive query multiple times with execution engine as
Hive on Spark and Hive on MapReduce.
With Hive on Spark: Result (count) were different of every execution.
With Hive on MapReduce: Result (count) were same of
Thank you, Dale, I've realized in what situation this bug would be activated.
Actually, it seems that any user-defined class with dynamic fields (such Map,
List...) could not be used as message, or it'll lost in the next supersteps. to
figure this out, I tried to deep-copy an new message object
I have two users (etl , dev ) start Spark Thrift Server in the same machine . I
connected by beeline etl STS to execute a command,and throwed
org.apache.hadoop.security.AccessControlException.I don't know why is dev user
perform,not etl.
It is a spark bug? I am using spark2.0.2
Caused by:
drop() function is in scala,an attribute of Array,no in spark
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p28127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Greetings,
I am using Spark 2.0.2 with Scala 2.11.7 and Hadoop 2.7.3. When I run
spark-submit local mode, I get a netty exception like the following. The
code runs fine with Spark 1.6.3, Scala 2.10.x and Hadoop 2.7.3.
6/11/24 08:18:24 ERROR server.TransportRequestHandler: Error sending result
Greetings,
I am using Spark 2.0.2 with Scala 2.11.7 and Hadoop 2.7.3. When I run
spark-submit local mode, I get a netty exception like the following. The
code runs fine with Spark 1.6.3, Scala 2.10.x and Hadoop 2.7.3.
6/11/24 08:18:24 ERROR server.TransportRequestHandler: Error sending result
I love working with the Python community & I've heard similar requests in
the past few months so its good to have a solid reason to try and add this
functionality :)
Just to be clear though I'm not a Spark committer so when I work on stuff
getting in it very much dependent on me finding a
thank u so much for this! Great to see that u listen to the community.
On Thu, Nov 24, 2016 at 12:10 PM, Holden Karau wrote:
> https://issues.apache.org/jira/browse/SPARK-18576
>
> On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau
> wrote:
>
>> Cool -
https://issues.apache.org/jira/browse/SPARK-18576
On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau wrote:
> Cool - thanks. I'll circle back with the JIRA number once I've got it
> created - will probably take awhile before it lands in a Spark release
> (since 2.1 has already
Cool - thanks. I'll circle back with the JIRA number once I've got it
created - will probably take awhile before it lands in a Spark release
(since 2.1 has already branched) but better debugging information for
Python users is certainly important/useful.
On Thu, Nov 24, 2016 at 2:03 AM, Ofer
Since we can't work with log4j in pyspark executors we build our own
logging infrastructure (based on logstash/elastic/kibana).
Would help to have TID in the logs, so we can drill down accordingly.
On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau wrote:
> Hi,
>
> The
YARN will kill your processes if the child processes you start via PIPE
consume too much memory, you can configured the amount of memory Spark
leaves aside for other processes besides the JVM in the YARN containers
with spark.yarn.executor.memoryOverhead.
On Wed, Nov 23, 2016 at 10:38 PM, Sameer
Hi,
The TaskContext isn't currently exposed in PySpark but I've been meaning to
look at exposing at least some of TaskContext for parity in PySpark. Is
there a particular use case which you want this for? Would help with
crafting the JIRA :)
Cheers,
Holden :)
On Thu, Nov 24, 2016 at 1:39 AM,
Hi,
Is there a way to get in PYSPARK something like TaskContext from a code
running on executor like in scala spark?
If not - how can i know my task id from inside the executors?
Thanks!
--
View this message in context:
According to this link
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md
I tried the following but it still looks like it is taking forever
sc.cassandraTable(keyspace, table).cassandraCount
On Thu, Nov 24, 2016 at 12:56 AM, kant kodali
I would be glad if SELECT COUNT(*) FROM hello can return any value for that
size :) I can say for sure it didn't return anything for 30 mins and I
probably need to build more patience to sit for few more hours after that!
Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which
How fast is Cassandra without Spark on the count operation?
cqsh> SELECT COUNT(*) FROM hello
(this is not equivalent with what you are doing but might help you find the
root of the cause)
On Thu, Nov 24, 2016 at 9:03 AM, kant kodali wrote:
> I have the following code
>
> I
I have the following code
I invoke spark-shell as follows
./spark-shell --conf spark.cassandra.connection.host=170.99.99.134
--executor-memory 15G --executor-cores 12 --conf
spark.cassandra.input.split.size_in_mb=67108864
code
scala> val df = spark.sql("SELECT test from hello") //
38 matches
Mail list logo