NoClassDefFound exception after setting spark.eventLog.enabled=true

2016-09-02 Thread C. Josephson
I use Spark 1.6.2 with Java, and after I set spark.eventLog.enabled=true spark crashes with this exception: Exception in thread "main" java.lang.NoClassDefFoundError: org/json4s/jackson/JsonMethods$ at org.apache.spark.scheduler.EventLoggingListener$.initEventLog(EventLoggingListener.scala:257)

Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Mich Talebzadeh
Hi, You can create Hive external tables on top of existing Hbase table using the property STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' Example hive> show create table hbase_table; OK CREATE TABLE `hbase_table`( `key` int COMMENT '', `value1` string COMMENT '', `value2`

Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread ayan guha
You can either read hbase in rdd and then turn it to a df or expose hbase tables using hive and read from hive or use phoenix On 3 Sep 2016 08:08, "KhajaAsmath Mohammed" wrote: > Hi Kim, > > I am also looking for same information. Just got the same requirement > today. >

Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread KhajaAsmath Mohammed
Hi Kim, I am also looking for same information. Just got the same requirement today. Thanks, Asmath On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim wrote: > I was wondering if anyone has tried to create Spark SQL tables on top of > HBase tables so that data in HBase can be

Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Benjamin Kim
I was wondering if anyone has tried to create Spark SQL tables on top of HBase tables so that data in HBase can be accessed using Spark Thriftserver with SQL statements? This is similar what can be done using Hive. Thanks, Ben

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread sagarcasual .
Hi Cody, thanks for the reply. I am using Spark 1.6.1 with Kafka 0.9. When I want to stop streaming, stopping the context sounds ok, but for temporarily excluding partitions is there any way I can supply topic-partition info on the fly at the beginning of every pull dynamically. Will

Re: Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Mich Talebzadeh
Since you are using Spark Thrift Server (which in turn uses Hive Thrift Server) I have this suspicion that it uses Hive optimiser which indicates that stats do matter. However, that may be just an assumption. Have you partitioned these parquet tables? Is it worth logging to Hive and run the same

Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Mich Talebzadeh
Hi, As I understand Spark memory allocation is used for execution ,memory and storage memory. The sum is deterministic (memory allocated in simplest form). So by using storage cache you impact the sum. Now 1. cache() is an alias to persist(memory_only) 2. caching is only done once. 3.

how to pass trustStore path into pyspark ?

2016-09-02 Thread Eric Ho
I'm trying to pass a trustStore pathname into pyspark. What env variable and/or config file or script I need to change to do this ? I've tried setting JAVA_OPTS env var but to no avail... any pointer much appreciated... thx -- -eric ho

Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Davies Liu
Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that do not cache the DataFrame, the first() is usually fast enough (only compute the partitions as needed). On Fri, Sep 2, 2016 at 1:05 PM, apu wrote: > When I first learnt Spark, I was told that

Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread apu
When I first learnt Spark, I was told that *cache()* is desirable anytime one performs more than one Action on an RDD or DataFrame. For example, consider the PySpark toy example below; it shows two approaches to doing the same thing. # Approach 1 (bad?) df2 = someTransformation(df1) a =

Reset auto.offset.reset in Kafka 0.10 integ

2016-09-02 Thread Srikanth
Hi, Upon restarting my Spark Streaming app it is failing with error Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 6, localhost):

Re: Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Jakob Odersky
Spark currently requires at least Java 1.7, so adding a Java 1.8-specific encoder will not be straightforward without affecting requirements. I can think of two solutions: 1. add a Java 1.8 build profile which includes such encoders (this may be useful for Scala 2.12 support in the future as

Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
And, no, Spark's scheduler will not preempt already running Tasks. In fact, just killing running Tasks for any reason is trickier than we'd like it to be, so it isn't done by default: https://issues.apache.org/jira/browse/SPARK-17064 On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra

Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
The comparator is used in `Pool#getSortedTaskSetQueue`. The `TaskSchedulerImpl` calls that on the rootPool when the TaskScheduler needs to handle `resourceOffers` for available Executor cores. Creation of the `sortedTaskSets` is a recursive, nested sorting of the `Schedulable` entities -- you

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread Cody Koeninger
If you just want to pause the whole stream, just stop the app and then restart it when you're ready. If you want to do some type of per-partition manipulation, you're going to need to write some code. The 0.10 integration makes the underlying kafka consumer pluggable, so you may be able to wrap

Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread sagarcasual .
Hello, this is for Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch = I have following code that creates a direct stream using Kafka connector for Spark. final JavaInputDStream msgRecords =

Does Spark support Partition Pruning with Parquet Files

2016-09-02 Thread Lost Overflow
Hello Everyone, Anyone can answer this: http://stackoverflow.com/q/37180073/6022341? Thanks in advance. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Scala Vs Python

2016-09-02 Thread darren
I politely disagree. The jvm is one vm. Python has another. It's less about preference and more about where the skills as an industry is going for data analysis and BI etc. No cares about jvm vs. Pvm. They do care about time. So if the time to prototype is 10x faster (in calendar days) but the

Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
It seems Spark can handle case classes with java.sql.Date, but not java.time.LocalDate. It complains there's no encoder. Are there any plans to add an encoder for LocalDate (and other classes in the new Java 8 Time and Date API), or is there an existing library I can use that provides encoders?

Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
No offence taken. Glad that it was rectified. Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
I apologize for my harsh tone. You are right, it was unnecessary and discourteous. On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh wrote: > Hi, > > You made such statement: > > "That's complete nonsense." > > That is a strong language and void of any courtesy. Only

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
You made a specific claim -- that Spark will move away from Python -- which I responded to with clear references and data. How on earth is that a "religious argument"? I'm not saying that Python is better than Scala or anything like that. I'm just addressing your specific claim about its future

Re: Scala Vs Python

2016-09-02 Thread andy petrella
looking at the examples, indeed they make nonsense :D On Fri, 2 Sep 2016 16:48 Mich Talebzadeh, wrote: > Right so. We are back into religious arguments. Best of luck > > > > Dr Mich Talebzadeh > > > > LinkedIn * >

Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
Right so. We are back into religious arguments. Best of luck Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Sean Owen
Given recall by threshold, you can compute true positive count per threshold by just multiplying through by the count of elements where label = 1. From that you can get false negatives by subtracting from that same count. Given precision by threshold, and true positives count by threshold, you

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh wrote: > I believe as we progress in time Spark is going to move away from Python. If > you look at 2014 Databricks code examples, they were mostly in Python. Now > they are mostly in Scala for a reason. > That's complete

BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Spencer, Alex (Santander)
Hi, BinaryClassificationMetrics expose recall and precision byThreshold. Is there a way to true negatives / false negatives etc per threshold? I have weighted my genuines and would like the adjusted precision / FPR. (Unless there is an option that I've missed, although I have been over the

Passing Custom App Id for consumption in History Server

2016-09-02 Thread Amit Shanker
Currently Spark sets current time in Milliseconds as the app Id. Is there a way one can pass in the app id to the spark job, so that it uses this provided app id instead of generating one using time? Lets take the following scenario : I have a system application which schedules spark jobs, and

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Ian Stokes Rees
On 9/2/16 3:47 AM, Felix Cheung wrote: There is an Anaconda parcel one could readily install on CDH https://docs.continuum.io/anaconda/cloudera As Sean says it is Python 2.7.x. Spark should work for both 2.7 and 3.5. Yes, I'm actually an engineer at Continuum, so I know the Anaconda parcel

Re: Grouping on bucketed and sorted columns

2016-09-02 Thread Fridtjof Sander
I succeeded to do some experimental evaluation, and it seems I correctly understood the code: A partition that consist of hive-buckets is read bucket-file by bucket-file, which leads to the loss of internal sorting. Does anyone have an opinion about my alternative idea of reading from

Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Сергей Романов
Hi, Mich, Column x29 does not seems to be any special. It's a newly created table and I did not calculate stats for any columns. Actually, I can sum a single column several times in query and face some landshift performance hit at some "magic" point. Setting "set

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-02 Thread Aseem Bansal
Hi Thanks for all the details. I was able to convert from ml.NaiveBayesModel to mllib.NaiveBayesModel and get it done. It is fast for our use case. Just one question. Before mllib is removed can ml package be expected to reach feature parity with mllib? On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen

Re: Spark scheduling mode

2016-09-02 Thread enrico d'urso
Thank you. May I know when that comparator is called? It looks like spark scheduler has not any form of preemption, am I right? Thank you From: Mark Hamstra Sent: Thursday, September 1, 2016 8:44:10 PM To: enrico d'urso Cc:

Re: Fwd: Need some help

2016-09-02 Thread Aakash Basu
Hi Shashank/All, Yes you got it right, that's what I need to do. Can I get some help in this? I've no clue what it is and how to work on it. Thanks, Aakash. On Fri, Sep 2, 2016 at 1:48 AM, Shashank Mandil wrote: > Hi Aakash, > > I think what it generally means that

Re: Scala Vs Python

2016-09-02 Thread Sivakumaran S
Whatever benefits you may accrue from the rapid prototyping and coding in Python, it will be offset against the time taken to convert it to run inside the JVM. This of course depends on the complexity of the DAG. I guess it is a matter of language preference. Regards, Sivakumaran S > On

Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
>From an outsider point of view nobody likes change :) However, it appears to me that Scala is a rising star and if one learns it, it is another iron in the fire so to speak. I believe as we progress in time Spark is going to move away from Python. If you look at 2014 Databricks code examples,

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Felix Cheung
There is an Anaconda parcel one could readily install on CDH https://docs.continuum.io/anaconda/cloudera As Sean says it is Python 2.7.x. Spark should work for both 2.7 and 3.5. _ From: Sean Owen > Sent: Friday,

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Sean Owen
Spark should work fine with Python 3. I'm not a Python person, but all else equal I'd use 3.5 too. I assume the issue could be libraries you want that don't support Python 3. I don't think that changes with CDH. It includes a version of Anaconda from Continuum, but that lays down Python 2.7.11. I

Re: [Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-09-02 Thread Divya Gehlot
Hi Steve, I am trying to read it from S3n://"bucket" and already included aws-java-sdk 1.7.4 in my classpath . My machine is AWS EMR with HAdoop 2.7.2 and Spark 1.6.1 installed . As per the below post its shows that issue with EMR Hadoop2.7.2

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
Forgot to answer your question about feature parity of Python w.r.t. Spark's different components I mostly work with scala so I can't say for sure but I think that all pre-2.0 features (that's basically everything except Structured Streaming) are on par. Structured Streaming is a pretty new

Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
As you point out, often the reason that Python support lags behind is that functionality is implemented in Scala, so the API in that language is "free" whereas Python support needs to be added explicitly. Nevertheless, Python bindings are an important part of Spark and is used by many people (this

RE: Scala Vs Python

2016-09-02 Thread Santoshakhilesh
I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and Datasets @ Apache Spark - NE Scala 2016 At 15:00 there is a slide to show a comparison of aggregating 10 Million integer pairs using RDD , DataFrame with different language bindings like Scala , Python , R As per

Re: Scala Vs Python

2016-09-02 Thread ayan guha
Tal: I think by nature of the project itself, Python APIs are developed after Scala and Java, and it is a fair trade off between speed of getting stuff to market. And more and more this discussion is progressing, I see not much issue in terms of feature parity. Coming back to performance, Darren

Re: Custom return code

2016-09-02 Thread Pierre Villard
Any hint? 2016-08-31 20:40 GMT+02:00 Pierre Villard : > Hi, > > I am using Spark 1.5.2 and I am submitting a job (jar file) using > spark-submit command in a yarn cluster mode. I'd like the command to return > a custom return code. > > In the run method, if I do: >

Re: Scala Vs Python

2016-09-02 Thread Tal Grynbaum
On Fri, Sep 2, 2016 at 1:15 AM, darren wrote: > This topic is a concern for us as well. In the data science world no one > uses native scala or java by choice. It's R and Python. And python is > growing. Yet in spark, python is 3rd in line for feature support, if at all. > >