Re: Flatten JSON to multiple columns in Spark

2017-07-19 Thread Chetan Khatri
Thank you Damji / All for guide. I made Schema according to my JSON, can you correct me is it correct schema: *JSON String* [{"cert":[{ "authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa", 009415da-c8cd-418d-869e-0a19601d79fa "certUUID":"03ea5a1a-5530-4fa3-8871-9d1ebac627c4",

Re: to_json not working with selectExpr

2017-07-19 Thread Matthew cao
Another question, how can I find the version in which one function is added into SQL? > On 2017年7月20日, at 10:43, Hyukjin Kwon wrote: > > Yes, I guess it is. > > 2017-07-20 11:31 GMT+09:00 Matthew cao >: > AH, I get it. So

Re: How to insert a dataframe as a static partition to a partitioned table

2017-07-19 Thread Ryan
Not sure about the writer api, but you could always register a temp table for that dataframe and execute insert hql. On Thu, Jul 20, 2017 at 6:13 AM, ctang wrote: > I wonder if there are any easy ways (or APIs) to insert a dataframe (or > DataSet), which does not contain the

Re: to_json not working with selectExpr

2017-07-19 Thread Matthew cao
Thank you so much! that confuse me quite a time! > On 2017年7月20日, at 10:43, Hyukjin Kwon wrote: > > Yes, I guess it is. > > 2017-07-20 11:31 GMT+09:00 Matthew cao >: > AH, I get it. So that’s why I get the not register

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yes, I guess it is. 2017-07-20 11:31 GMT+09:00 Matthew cao : > AH, I get it. So that’s why I get the not register error? Cuz it not added > into SQL in 2.1.0? > > On 2017年7月19日, at 22:35, Hyukjin Kwon wrote: > > Yea, but it was added into SQL from Spark

Re: to_json not working with selectExpr

2017-07-19 Thread Matthew cao
AH, I get it. So that’s why I get the not register error? Cuz it not added into SQL in 2.1.0? > On 2017年7月19日, at 22:35, Hyukjin Kwon > wrote: > > Yea, but it was added into SQL from Spark 2.2.0 > > 2017-07-19 23:02 GMT+09:00 Matthew cao

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Hmm... that's not enough info and logs are intentionally kept silent to avoid flooding, but if you enable DEBUG level logging for org.apache.spark.network.crypto in both YARN and the Spark app, that might provide more info. On Wed, Jul 19, 2017 at 2:58 PM, Udit Mehrotra

How to insert a dataframe as a static partition to a partitioned table

2017-07-19 Thread ctang
I wonder if there are any easy ways (or APIs) to insert a dataframe (or DataSet), which does not contain the partition columns, as a static partition to the table. For example, The DataSet with columns (col1, col2) will be inserted into a table (col1, col2) partitioned by column partcol as a

ClassNotFoundException for Workers

2017-07-19 Thread Noppanit Charassinvichai
I have this spark job which is using S3 client in mapPartition. And I get this error Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 74, ip-10-90-78-177.ec2.internal, executor 11): java.lang.NoClassDefFoundError: Could not

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Udit Mehrotra
So I added these settings in yarn-site.xml as well. Now I get a completely different error, but atleast it seems like it is using the crypto library: ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with

Re: about aggregateByKey of pairrdd.

2017-07-19 Thread Sumedh Wale
On Wednesday 19 July 2017 06:20 PM, qihuagao wrote: java pair rdd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will

Re: how does spark handle compressed files

2017-07-19 Thread Jörn Franke
Spark uses the Hadoop API to access files. This means they are transparently decompressed. However gzip can be only decompressed in a single thread / file and bzip2 is very slow. The best is either to have multiple files (each one at least the size of a HDFS block) or better to use a modern

how does spark handle compressed files

2017-07-19 Thread Ashok Kumar
Hi, How does spark handle compressed files? Are they optimizable in terms of using multiple RDDs against the file pr one needs to uncompress them beforehand say bz type files. thanks

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Udit Mehrotra
So i am using the standard documentation here for configuring external shuffle service: https://spark.apache.org/docs/latest/running-on-yarn.html# configuring-the-external-shuffle-service I can see on the cluster that following jars are present as well:

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 1:10 PM, Udit Mehrotra wrote: > Is there any additional configuration I need for external shuffle besides > setting the following: > spark.network.crypto.enabled true > spark.network.crypto.saslFallback false > spark.authenticate

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Well, how did you install the Spark shuffle service on YARN? It's not part of YARN. If you really have the Spark 2.2 shuffle service jar deployed in your YARN service, then perhaps you didn't configure it correctly to use the new auth mechanism. On Wed, Jul 19, 2017 at 12:47 PM, Udit Mehrotra

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Udit Mehrotra
Sorry about that. Will keep the list in my replies. So, just to clarify I am not using an older version of sparks shuffle service. This is a brand new cluster with just Spark 2.2.0 installed alongside hadoop 2.7.3. Could there be anything else I am missing, or I can try differently ? Thanks !

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
Please include the list on your replies, so others can benefit from the discussion too. On Wed, Jul 19, 2017 at 11:43 AM, Udit Mehrotra wrote: > Hi Marcelo, > > Thanks a lot for confirming that. Can you explain what you mean by upgrading > the version of shuffle

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread Muthu Jayakumar
The problem with 'spark.sql.shuffle.partitions' is that, it needs to be set before spark session is create (I guess?). But ideally, I want to partition by column during a join / group-by (something roughly like repartitionBy(partitionExpression: Column*) from

Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Udit Mehrotra
Hi Spark Dev’s, I am trying out the new Spark’s internal authentication mechanism based off AES encryption, https://issues.apache.org/jira/browse/SPARK-19139 which has come up in Spark 2.2.0. I set the following properties in my spark-defaults: spark.network.crypto.enabled true

Re: Question regarding Sparks new Internal authentication mechanism

2017-07-19 Thread Marcelo Vanzin
On Wed, Jul 19, 2017 at 11:19 AM, Udit Mehrotra wrote: > spark.network.crypto.saslFallback false > spark.authenticate true > > This seems to work fine with internal shuffle service of Spark. However, > when in I try it with Yarn’s external shuffle service

Re: Spark history server running on Mongo

2017-07-19 Thread Ivan Sadikov
Yes, you are absolutely right, though UI does not change often, and it potentially allows to iterate faster, IMHO, which is why started working on this. For me, it felt like this functionality could easily be outsourced to a separate project. And, as you pointed out, I did add some small fixes to

Re: Spark history server running on Mongo

2017-07-19 Thread Marcelo Vanzin
On Tue, Jul 18, 2017 at 7:21 PM, Ivan Sadikov wrote: > Repository that I linked to does not require rebuilding Spark and could be > used with current distribution, which is preferable in my case. Fair enough, although that means that you're re-implementing the Spark UI,

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread ayan guha
You can use spark.sql.shuffle.partitions to adjust amount of parallelism. On Wed, Jul 19, 2017 at 11:41 PM, muthu wrote: > Hello there, > > Thank you for looking into the question. > > >Is the partition count of df depending on fields of groupby? > Absolute partition number

MPEG files optimisation with Spark

2017-07-19 Thread Mich Talebzadeh
Hi, let us assume that we want to use Spark on MPEG large files. i believe these files can be tored as zsequence files in HDFS. However how does Spark can optimize processing these type of large files? regards, Mich Dr Mich Talebzadeh LinkedIn *

Re: to_json not working with selectExpr

2017-07-19 Thread Hyukjin Kwon
Yea, but it was added into SQL from Spark 2.2.0 2017-07-19 23:02 GMT+09:00 Matthew cao : > I am using version 2.1.1 As I could remember, this function was added > since 2.1.0. > > On 2017年7月17日, at 12:05, Burak Yavuz wrote: > > Hi Matthew, > > Which Spark

Re: Slow responce on Solr Cloud with Spark

2017-07-19 Thread Anastasios Zouzias
Hi Imran, It seems that you do not cache your underlying DataFrame. I would suggest to force a cache with tweets.cache() and then tweets.count(). Let us know if your problem persists. Best, Anastasios On Wed, Jul 19, 2017 at 2:49 PM, Imran Rajjad wrote: > Greetings, > > We

Re: Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Nick Pentreath
L-BFGS is the default optimization method since the initial ML package implementation. The OWLQN variant is used only when L1 regularization is specified (via the elasticNetParam). 2.2 adds the box constraints (optimized using the LBFGS-B variant). So no, no upgrade is required to use L-BFGS - if

Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Aseem Bansal
Hi I was reading the API of Spark 2.2.0 and noticed a change compared to 2.1.0 Compared to https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression the 2.2.0 docs at

Re: to_json not working with selectExpr

2017-07-19 Thread Matthew cao
I am using version 2.1.1 As I could remember, this function was added since 2.1.0. > On 2017年7月17日, at 12:05, Burak Yavuz > wrote: > > Hi Matthew, > > Which Spark version are you using? The expression `to_json` was added in 2.2 > with this commit:

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread muthu
Hello there, Thank you for looking into the question. >Is the partition count of df depending on fields of groupby? Absolute partition number or by column value to determine the partition count would be fine for me (which is similar to repartition() I suppose) >Also is the performance of

Re: DataFrame --- join / groupBy-agg question...

2017-07-19 Thread qihuagao
also interested in this. Is the partition count of df depending on fields of groupby? Also is the performance of groupby-agg comparable to reducebykey/aggbykey? -- View this message in context:

Slow responce on Solr Cloud with Spark

2017-07-19 Thread Imran Rajjad
Greetings, We are trying out Spark 2 + ThriftServer to join multiple collections from a Solr Cloud (6.4.x). I have followed this blog https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/ I understand that initially spark populates the temporary table with 18633014 records and takes its

about aggregateByKey of pairrdd.

2017-07-19 Thread qihuagao
java pair rdd has aggregateByKey, which can avoid full shuffle, so have impressive performance. which has parameters, The aggregateByKey function requires 3 parameters: # An intitial ‘zero’ value that will not effect the total values to be collected # A combining function accepting two

Feature Generation for Large datasets composed of many time series

2017-07-19 Thread julio . cesare
Hello, I want to create a lib which generates features for potentially very large datasets. Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( int or string ) I want my tool to : - compute aggregate function for many

Re: Spark | Window Function |

2017-07-19 Thread Julien CHAMP
Hi and thanks a lot for your example ! Ok i've found my problem.. There was too much data ( 1000 ids / 1000 timestamps ) for my test, and it does not seems to work in such cases :/ This does not seems to scale linearly with the number of id. With a small example, with 1000 timestamps per id : -

Re: Flatten JSON to multiple columns in Spark

2017-07-19 Thread Chetan Khatri
As i am beginner, if some one can give psuedocode would be highly appreciated On Tue, Jul 18, 2017 at 11:43 PM, lucas.g...@gmail.com wrote: > That's a great link Michael, thanks! > > For us it was around attempting to provide for dynamic schemas which is a > bit of an

Structured Streaming: Row differences, e.g., with Window and lag()

2017-07-19 Thread Karamba
Hi, I am looking for approaches to compare a row with the next one to determine, e.g., differences in event-times/timestamps. I just found a couple of solutions that use Window class, but that does not seem to work on streaming data, such as