Unsubscribe

2016-08-15 Thread 何琪

Re: Spark Yarn executor container memory

2016-08-15 Thread Jörn Franke
Both are part of the heap. > On 16 Aug 2016, at 04:26, Lan Jiang wrote: > > Hello, > > My understanding is that YARN executor container memory is based on > "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one > is for heap memory and second one is

Unsubscribe

2016-08-15 Thread Sarath Chandra

Re: [ANNOUNCE] Apache Bahir 2.0.0

2016-08-15 Thread Mridul Muralidharan
Congratulations, great job everyone ! Regards, Mridul On Mon, Aug 15, 2016 at 2:19 PM, Luciano Resende wrote: > The Apache Bahir PMC is pleased to announce the release of Apache Bahir > 2.0.0 which is our first major release and provides the following > extensions for

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread Saisai Shao
The implementation inside the Python API and Scala API for RDD is slightly different, so the difference of RDD lineage you printed is expected. On Tue, Aug 16, 2016 at 10:58 AM, DEEPAK SHARMA wrote: > Hi All, > > > Below is the small piece of code in scala and

Re: Apache Spark toDebugString producing different output for python and scala repl

2016-08-15 Thread DEEPAK SHARMA
Hi All, Below is the small piece of code in scala and python REPL in Apache Spark.However I am getting different output in both the language when I execute toDebugString.I am using cloudera quick start VM. PYTHON rdd2 =

Spark Yarn executor container memory

2016-08-15 Thread Lan Jiang
Hello, My understanding is that YARN executor container memory is based on "spark.executor.memory" + “spark.yarn.executor.memoryOverhead”. The first one is for heap memory and second one is for offheap memory. The spark.executor.memory is used by -Xmx to set the max heap size. Now my question

Re: read kafka offset from spark checkpoint

2016-08-15 Thread Cody Koeninger
No, you really shouldn't rely on checkpoints if you cant afford to reprocess from the beginning of your retention (or lose data and start from the latest messages). If you're in a real bind, you might be able to get something out of the serialized data in the checkpoint, but it'd probably be

Re: Sum array values by row in new column

2016-08-15 Thread Mike Metzger
Assuming you know the number of elements in the list, this should work: df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) + df["_1"].getItem(2)) Mike On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey wrote: > Hi everyone, > > I have one dataframe with one column

Re: [ANNOUNCE] Apache Bahir 2.0.0

2016-08-15 Thread Mridul Muralidharan
Congratulations, great job everyone ! Regards Mridul On Monday, August 15, 2016, Luciano Resende wrote: > The Apache Bahir PMC is pleased to announce the release of Apache Bahir > 2.0.0 which is our first major release and provides the following > extensions for Apache

Re: [ANNOUNCE] Apache Bahir 2.0.0

2016-08-15 Thread Chris Mattmann
Great work Luciano! On 8/15/16, 2:19 PM, "Luciano Resende" wrote: The Apache Bahir PMC is pleased to announce the release of Apache Bahir 2.0.0 which is our first major release and provides the following extensions for Apache Spark 2.0.0 : Akka

SizeEstimator for python

2016-08-15 Thread Maurin Lenglart
Hi, Is there a way to estimate the size of a dataframe in python? Something similar to https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/util/SizeEstimator.html ? thanks

[ANNOUNCE] Apache Bahir 2.0.0

2016-08-15 Thread Luciano Resende
The Apache Bahir PMC is pleased to announce the release of Apache Bahir 2.0.0 which is our first major release and provides the following extensions for Apache Spark 2.0.0 : Akka Streaming MQTT Streaming and Structured Streaming Twitter Streaming ZeroMQ Streaming For more information about

read kafka offset from spark checkpoint

2016-08-15 Thread Shifeng Xiao
Hi folks, We are using kafka + spark streaming in our data pipeline, but sometimes we have to clean up checkpoint from hdfs before we restart spark streaming application, otherwise the application fails to start. That means we are losing data when we clean up checkpoint, is there a way to read

Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
Thanks Daniel. Do you have any code fragments on using CoGroups or Joins across 2 RDDs ? I don't think that index would help much because this is an N x M operation, examining each cell of each RDD. Each comparison is complex as it needs to peer into a complex JSON On Mon, Aug 15, 2016 at 1:24

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman
There's no real way of doing nested for-loops with RDD's because the whole idea is that you could have so much data in the RDD that it would be really ugly to store it all in one worker. There are, however, ways to handle what you're asking about. I would personally use something like CoGroup or

How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
I've nested foreach loops like this: for i in A[i] do: for j in B[j] do: append B[j] to some list if B[j] 'matches' A[i] in some fashion. Each element in A or B is some complex structure like: ( some complex JSON, some number ) Question: if A and B were represented as RRDs (e.g.

Re: Change nullable property in Dataset schema

2016-08-15 Thread Koert Kuipers
why do you want the array to have nullable = false? what is the benefit? On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki wrote: > Dear all, > Would it be possible to let me know how to change nullable property in > Dataset? > > When I looked for how to change nullable

Re: Number of tasks on executors become negative after executor failures

2016-08-15 Thread Sean Owen
-dev (this is appropriate for user@) Probably https://issues.apache.org/jira/browse/SPARK-10141 or https://issues.apache.org/jira/browse/SPARK-11334 but those aren't resolved. Feel free to jump in. On Mon, Aug 15, 2016 at 8:13 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote:

Number of tasks on executors become negative after executor failures

2016-08-15 Thread Rachana Srivastava
Summary: I am running Spark 1.5 on CDH5.5.1. Under extreme load intermittently I am getting this connection failure exception and later negative executor in the Spark UI. Exception: TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 76ms INFO :

Re: how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Jörn Franke
Depends on the size of the arrays, but is it what you want to achieve similar to a join? > On 15 Aug 2016, at 20:12, Eric Ho wrote: > > Hi, > > I've two nested-for loops like this: > > for all elements in Array A do: > > for all elements in Array B do: > > compare

how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
Hi, I've two nested-for loops like this: *for all elements in Array A do:* *for all elements in Array B do:* *compare a[3] with b[4] see if they 'match' and if match, return that element;* If I were to represent Arrays A and B as 2 separate RDDs, how would my code look like ? I couldn't find

Sum array values by row in new column

2016-08-15 Thread Javier Rey
Hi everyone, I have one dataframe with one column this column is an array of numbers, how can I sum each array by row a obtain a new column with sum? in pyspark. Example: ++ | numbers| ++ |[10, 20, 30]| |[40, 50, 60]| |[70, 80, 90]| ++ The idea is obtain

Re: class not found exception Logging while running JavaKMeansExample

2016-08-15 Thread Ted Yu
Logging has become private in 2.0 release: private[spark] trait Logging { On Mon, Aug 15, 2016 at 9:48 AM, subash basnet wrote: > Hello all, > > I am trying to run JavaKMeansExample of the spark example project. I am > getting the classnotfound exception error: > *Exception

class not found exception Logging while running JavaKMeansExample

2016-08-15 Thread subash basnet
Hello all, I am trying to run JavaKMeansExample of the spark example project. I am getting the classnotfound exception error: *Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging* at java.lang.ClassLoader.defineClass1(Native Method) at

Submitting jobs to YARN from outside EMR -- config & S3 impl

2016-08-15 Thread Everett Anderson
Hi, We're currently using an EMR cluster (which uses YARN) but submitting Spark jobs to it using spark-submit from different machines outside the cluster. We haven't had time to investigate using something like Livy , yet. We also have a need to use a mix of

Re: call a mysql stored procedure from spark

2016-08-15 Thread Mich Talebzadeh
Well that is not the best way as you have to wait for RDBMS to process and populate the temp table. A more sound way would be to write a shell script to talk to RDBMS first and creates and populates that table. Once ready the same shell script can kick off Spark job to read the temp table which

Re: call a mysql stored procedure from spark

2016-08-15 Thread sujeet jog
Thanks Michael, Michael, Ayan rightly said, yes this stored procedure is invoked from driver, this creates the temporary table is DB, the reason being i want to load some specific data after processing it, i do not wish to bring it in spark, instead want to keep the processing at DB level, later

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, Thank you Regards,, Rajesh On Mon, Aug 15, 2016 at 6:35 PM, Mich Talebzadeh wrote: > Ok Rajesh > > This is standalone. > > In that case it ought to be at least 4 connections as one executor will > use one worker. > > I am hesitant in here as you can see

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
Ok Rajesh This is standalone. In that case it ought to be at least 4 connections as one executor will use one worker. I am hesitant in here as you can see with (at least) as with Standalone mode you may end up with more executors on each worker. But try it and see whether numPartitions" -> "4"

Re: Does Spark SQL support indexes?

2016-08-15 Thread Mich Talebzadeh
Brave and wise answer :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your

Re: Does Spark SQL support indexes?

2016-08-15 Thread Gourav Sengupta
I think that I have scratched a hornet's nest here. If you are comfortable mentioning faster way to access data as indexes then its fine. And everyone is and in the foreseeable future going to continue to use indexes. When I think about reaching data faster, I just refer to the methods available

Re: Accessing HBase through Spark with Security enabled

2016-08-15 Thread Steve Loughran
On 15 Aug 2016, at 08:29, Aneela Saleem > wrote: Thanks Jacek! I have already set hbase.security.authentication property set to kerberos, since Hbase with kerberos is working fine. I tested again after correcting the typo but got same

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, Thank you for detailed explanation. One more question In my cluster, I have one master and 4 workers. In this case, 4 connections will be opened to Oracle ? Regards, Rajesh On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh wrote: > It happens that the

Re: Does Spark SQL support indexes?

2016-08-15 Thread Mich Talebzadeh
My two cents Indexes on any form and shape are there to speed up the query whether it is classical index (B-tree), store-index (data and stats stored together), like Oracle Exalytics, SAP Hana, Hive ORC tables or in-memory databases (hash index). Indexes are there to speed up the access path in

Linear regression, weights constraint

2016-08-15 Thread letaiv
Hi all, Is there any approach to add constrain for weights in linear regression? What I need is least squares regression with non-negative constraints on the coefficients/weights. Thanks in advance. -- View this message in context:

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
It happens that the number of parallel processes open from Spark to RDBMS is determined by the number of executors. I just tested this. With Yarn client using to executors I see two connections to RDBMS EXECUTIONS USERNAME SID SERIAL# USERS_EXECUTING SQL_TEXT -- --

Re: Does Spark SQL support indexes?

2016-08-15 Thread u...@moosheimer.com
So you mean HBase, Cassandra, Hana, Elasticsearch and so on do not use idexes? There might be some very interesting new concepts I've missed? Could you be more precise? ;-) Regards, Uwe Am 15.08.2016 um 11:59 schrieb Gourav Sengupta: > The world has moved in from indexes, materialized views,

RE: Does Spark SQL support indexes?

2016-08-15 Thread Ashic Mahtab
Guess the good people in the Cassandra world are stuck in the past making indexes, materialized views, etc. better with every release :) From: mich.talebza...@gmail.com Date: Mon, 15 Aug 2016 11:11:03 +0100 Subject: Re: Does Spark SQL support indexes? To: gourav.sengu...@gmail.com CC:

Re: Does Spark SQL support indexes?

2016-08-15 Thread Mich Talebzadeh
Are you sure about that Gourav :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it

Re: Does Spark SQL support indexes?

2016-08-15 Thread Gourav Sengupta
The world has moved in from indexes, materialized views, and other single processor non-distributed system algorithms. Nice that you are not asking questions regarding hierarchical file systems. Regards, Gourav On Sun, Aug 14, 2016 at 4:03 AM, Taotao.Li wrote: > > hi,

RE: Simulate serialization when running local

2016-08-15 Thread Ashic Mahtab
Thanks Miguel...will have a read. Thanks Jacek...that looks incredibly useful. :) Subject: Re: Simulate serialization when running local From: mig...@zero-x.co Date: Sun, 14 Aug 2016 21:07:41 -0700 CC: as...@live.com; user@spark.apache.org To: ja...@japila.pl Hi Ashic, Absolutely.

Re: spark ml : auc on extreme distributed data

2016-08-15 Thread Sean Owen
Class imbalance can be an issue for algorithms, but decision forests should in general cope reasonably well with imbalanced classes. By default, positive and negative classes are treated 'equally' however, and that may not reflect reality in some cases. Upsampling the under-represented case is a

Can not find usage of classTag variable defined in abstract class AtomicType in spark project

2016-08-15 Thread Andy Zhao
When I read spark source code, I found an abstract class AtomicType. It's defined like this: protected[sql] abstract class AtomicType extends DataType { private[sql] type InternalType private[sql] val tag: TypeTag[InternalType] private[sql] val ordering: Ordering[InternalType] @transient

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
Hi. This is a very good question I did some tests on this. If you are joining two tables then you are creating a result set based on some conditions. In this case what I normally do is to specify an ID column from either tables and will base my partitioning on that ID column. This is pretty

Re: Accessing HBase through Spark with Security enabled

2016-08-15 Thread Aneela Saleem
Thanks Jacek! I have already set hbase.security.authentication property set to kerberos, since Hbase with kerberos is working fine. I tested again after correcting the typo but got same error. Following is the code, Please have a look: System.setProperty("java.security.krb5.conf",

Re: parallel processing with JDBC

2016-08-15 Thread ayan guha
Hi I would suggest you to look at sqoop as well. Essentially, you can provide a splitBy/partitionBy column using which data will be distributed among your stated number of mappers On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Mich, > > I have a

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich, I have a below question. I want to join two tables and return the result based on the input value. In this case, how we need to specify lower bound and upper bound values ? select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where t1.transactionid=*1* and t1.id = t2.id

Re: parallel processing with JDBC

2016-08-15 Thread Mich Talebzadeh
If you have your RDBMS table partitioned, then you need to consider how much data you want to extract in other words the result set returned by the JDBC call. If you want all the data, then the number of partitions specified in the JDBC call should be equal to the number of partitions in your