Re: New Contributor

2016-11-24 Thread K. Omair Muhi
Hello Manolis, I'm a new subscriber to this mailing list as well and I read on the Apache web-page that one can begin with following these mailing lists and help out other new users by pointing them to the right documentation or maybe go through some documentation yourself in order to answer

RE: OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Shreya Agarwal
I don’t think it’s just memory overhead. It might be better to use an execute with lesser heap space(30GB?). 46 GB would mean more data load into memory and more GC, which can cause issues. Also, have you tried to persist data in any way? If so, then that might be causing an issue. Lastly, I

New Contributor

2016-11-24 Thread Manolis Gemeliaris
Hi all , My name is Manolis Gemeliaris , I'm a software engineering student and I'm willing to contribute to the Apache Spark Project. I don't have any prior experience with contributing to open source. I have prior experience with Java , R (just a little) and Python (just a little) and

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
We have a 8 node Cassandra Cluster. Replication Strategy: 3 Consistency Level Quorum. Data Spread: I can let you know once I get access to our production cluster. The use case for simple count is more for internal use than say end clients/customers however there are many uses cases from customers

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Jörn Franke
I am not sure what use case you want to demonstrate with select count in general. Maybe you can elaborate more what your use case is. Aside from this: this is a Cassandra issue. What is the setup of Cassandra? Dedicated nodes? How many? Replication strategy? Consistency configuration? How is

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
I the above setup my executors start one docker container per task. Some of these containers grow in memory as data is piped. Eventually there is not enough memory on the machine for docker containers to run (since YARN already started its containers), and everything starts failing. The way I'm

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
So if the process your communicating with from Spark isn't launched inside of its YARN container then it shouldn't be an issue - although it sounds like you maybe have multiple resource managers on the same machine which can sometimes lead to interesting/difficult states. On Thu, Nov 24, 2016 at

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Sameer Choudhary
Ok, that makes sense for processes directly launched via fork or exec from the task. However, in my case the nd that starts docker daemon starts the new process. This process runs in a docker container. Will the container use memory from YARN executor memory overhead, as well? How will YARN know

Kryo Exception: NegativeArraySizeException

2016-11-24 Thread Pedro Tuero
Hi, I'm trying to broadcast a map of 2.6GB but I'm getting a weird Kryo exception. I tried to set -XX:hashCode=0 in executor and driver, following this copmment: https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808 But it didn't change anything. Are you aware of this

[no subject]

2016-11-24 Thread Rostyslav Sotnychenko

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
some accurate numbers here. so it took me 1hr:30 mins to count 698705723 rows (~700 Million) and my code is just this sc.cassandraTable("cuneiform", "blocks").cassandraCount On Thu, Nov 24, 2016 at 10:48 AM, kant kodali wrote: > Take a look at this

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Rodrick Brown
Try setting spark.yarn.executor.memoryOverhead 1 On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Hi Spark users > > I am running a job that does join of a huge dataset (7 TB+) and the > executors keep crashing randomly, eventually causing the job to

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
Take a look at this https://github.com/brianmhess/cassandra-count Now It is just matter of incorporating it into spark-cassandra-connector I guess. On Thu, Nov 24, 2016 at 1:01 AM, kant kodali wrote: > According to this link https://github.com/datastax/ >

Re: spark sql jobs heap memory

2016-11-24 Thread Rohit Karlupia
Dataset/dataframes will use direct/raw/off-heap memory in the most efficient columnar fashion. Trying to fit the same amount of data in heap memory would likely increase your memory requirement and decrease the speed. So, in short, don't worry about it and increase overhead. You can also set a

Re: covert local tsv file to orc file on distributed cloud storage(openstack).

2016-11-24 Thread vr spark
Hi, The source file i have is on local machine and its pretty huge like 150 gb. How to go about it? On Sun, Nov 20, 2016 at 8:52 AM, Steve Loughran wrote: > > On 19 Nov 2016, at 17:21, vr spark wrote: > > Hi, > I am looking for scala or python

Re: how to see Pipeline model information

2016-11-24 Thread Xiaomeng Wan
here is the scala code I use to get the best model, I never used java val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid) val cvModel = cv.fit(data) val plmodel = cvModel.bestModel.asInstanceOf[PipelineModel]

Re: get specific tree or forest structure from pipeline model

2016-11-24 Thread Zhiliang Zhu
scala codes are also for me, if there is some solution . On Friday, November 25, 2016 1:27 AM, Zhiliang Zhu wrote: Hi All, Here want to print the specific tree or forest structure from pipeline model.  However, it seems that here met more issue about

get specific tree or forest structure from pipeline model

2016-11-24 Thread Zhiliang Zhu
Hi All, Here want to print the specific tree or forest structure from pipeline model.  However, it seems that here met more issue about XXXClassifier and XXXClassificationModel, as the codes below: ...        GBTClassifier gbtModel = new GBTClassifier();        ParamMap[] grid = new

Re: how to see Pipeline model information

2016-11-24 Thread Zhiliang Zhu
Hi Xiaomeng, Thanks very much for your comment, which is helpful for me. However, it seems that here met more issue about XXXClassifier and XXXClassificationModel,as the codes below: ...        GBTClassifier gbtModel = new GBTClassifier();        ParamMap[] grid = new ParamGridBuilder()     

OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Aniket Bhatnagar
Hi Spark users I am running a job that does join of a huge dataset (7 TB+) and the executors keep crashing randomly, eventually causing the job to crash. There are no out of memory exceptions in the log and looking at the dmesg output, it seems like the OS killed the JVM because of high memory

Hive on Spark is not populating correct records

2016-11-24 Thread Vikash Pareek
Hi, Not sure whether it is right place to discuss this issue. I am running following Hive query multiple times with execution engine as Hive on Spark and Hive on MapReduce. With Hive on Spark: Result (count) were different of every execution. With Hive on MapReduce: Result (count) were same of

Re: GraphX Pregel not update vertex state properly, cause messages loss

2016-11-24 Thread 吴 郎
Thank you, Dale, I've realized in what situation this bug would be activated. Actually, it seems that any user-defined class with dynamic fields (such Map, List...) could not be used as message, or it'll lost in the next supersteps. to figure this out, I tried to deep-copy an new message object

multiple Spark Thrift Servers running in the same machine throws org.apache.hadoop.security.AccessControlException

2016-11-24 Thread 谭 成灶
I have two users (etl , dev ) start Spark Thrift Server in the same machine . I connected by beeline etl STS to execute a command,and throwed org.apache.hadoop.security.AccessControlException.I don't know why is dev user perform,not etl. It is a spark bug? I am using spark2.0.2 Caused by:

Re: Fwd: Spark SQL: ArrayIndexOutofBoundsException

2016-11-24 Thread cossy
drop() function is in scala,an attribute of Array,no in spark -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-ArrayIndexOutofBoundsException-tp15639p28127.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError

2016-11-24 Thread kshyamsunder
Greetings, I am using Spark 2.0.2 with Scala 2.11.7 and Hadoop 2.7.3. When I run spark-submit local mode, I get a netty exception like the following. The code runs fine with Spark 1.6.3, Scala 2.10.x and Hadoop 2.7.3. 6/11/24 08:18:24 ERROR server.TransportRequestHandler: Error sending result

io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError:

2016-11-24 Thread Karthik Shyamsunder
Greetings, I am using Spark 2.0.2 with Scala 2.11.7 and Hadoop 2.7.3. When I run spark-submit local mode, I get a netty exception like the following. The code runs fine with Spark 1.6.3, Scala 2.10.x and Hadoop 2.7.3. 6/11/24 08:18:24 ERROR server.TransportRequestHandler: Error sending result

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
I love working with the Python community & I've heard similar requests in the past few months so its good to have a solid reason to try and add this functionality :) Just to be clear though I'm not a Spark committer so when I work on stuff getting in it very much dependent on me finding a

Re: PySpark TaskContext

2016-11-24 Thread Ofer Eliassaf
thank u so much for this! Great to see that u listen to the community. On Thu, Nov 24, 2016 at 12:10 PM, Holden Karau wrote: > https://issues.apache.org/jira/browse/SPARK-18576 > > On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau > wrote: > >> Cool -

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
https://issues.apache.org/jira/browse/SPARK-18576 On Thu, Nov 24, 2016 at 2:05 AM, Holden Karau wrote: > Cool - thanks. I'll circle back with the JIRA number once I've got it > created - will probably take awhile before it lands in a Spark release > (since 2.1 has already

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
Cool - thanks. I'll circle back with the JIRA number once I've got it created - will probably take awhile before it lands in a Spark release (since 2.1 has already branched) but better debugging information for Python users is certainly important/useful. On Thu, Nov 24, 2016 at 2:03 AM, Ofer

Re: PySpark TaskContext

2016-11-24 Thread Ofer Eliassaf
Since we can't work with log4j in pyspark executors we build our own logging infrastructure (based on logstash/elastic/kibana). Would help to have TID in the logs, so we can drill down accordingly. On Thu, Nov 24, 2016 at 11:48 AM, Holden Karau wrote: > Hi, > > The

Re: Yarn resource utilization with Spark pipe()

2016-11-24 Thread Holden Karau
YARN will kill your processes if the child processes you start via PIPE consume too much memory, you can configured the amount of memory Spark leaves aside for other processes besides the JVM in the YARN containers with spark.yarn.executor.memoryOverhead. On Wed, Nov 23, 2016 at 10:38 PM, Sameer

Re: PySpark TaskContext

2016-11-24 Thread Holden Karau
Hi, The TaskContext isn't currently exposed in PySpark but I've been meaning to look at exposing at least some of TaskContext for parity in PySpark. Is there a particular use case which you want this for? Would help with crafting the JIRA :) Cheers, Holden :) On Thu, Nov 24, 2016 at 1:39 AM,

PySpark TaskContext

2016-11-24 Thread ofer
Hi, Is there a way to get in PYSPARK something like TaskContext from a code running on executor like in scala spark? If not - how can i know my task id from inside the executors? Thanks! -- View this message in context:

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md I tried the following but it still looks like it is taking forever sc.cassandraTable(keyspace, table).cassandraCount On Thu, Nov 24, 2016 at 12:56 AM, kant kodali

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I would be glad if SELECT COUNT(*) FROM hello can return any value for that size :) I can say for sure it didn't return anything for 30 mins and I probably need to build more patience to sit for few more hours after that! Cassandra recommends to use ColumnFamilyStats using nodetool cfstats which

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread Anastasios Zouzias
How fast is Cassandra without Spark on the count operation? cqsh> SELECT COUNT(*) FROM hello (this is not equivalent with what you are doing but might help you find the root of the cause) On Thu, Nov 24, 2016 at 9:03 AM, kant kodali wrote: > I have the following code > > I

Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I have the following code I invoke spark-shell as follows ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864 code scala> val df = spark.sql("SELECT test from hello") //