RE: Window Functions with SQLContext

2016-08-31 Thread Saurabh Dubey
Hi Ayan,   Even with SQL query like:   SQLContext sqlContext = new SQLContext(jsc); DataFrame df_2_sql = sqlContext.sql("select assetId, row_number() over ( partition by " +     "assetId order by assetId) as " +     "serial from

Re: Window Functions with SQLContext

2016-08-31 Thread ayan guha
I think you can write the SQL query and run it usin sqlContext. like select *,row_number() over(partitin by assetid order by assetid) rn from t On Thu, Sep 1, 2016 at 3:16 PM, saurabh3d wrote: > Hi All, > > As per SPARK-11001

RE: Window Functions with SQLContext

2016-08-31 Thread Saurabh Dubey
Hi Adline, rowNumber and row_number are same functions: @scala.deprecated("Use row_number. This will be removed in Spark 2.0.") def rowNumber() : org.apache.spark.sql.Column = { /* compiled code */ } def row_number() : org.apache.spark.sql.Column = { /* compiled code */ } but the issue here is

RE: Window Functions with SQLContext

2016-08-31 Thread Adline Dsilva
Hi, Use function rowNumber instead of row_number df1.withColumn("row_number", rowNumber.over(w)); Regards, Adline From: saurabh3d [saurabh.s.du...@oracle.com] Sent: 01 September 2016 13:16 To: user@spark.apache.org Subject: Window Functions with

RE: Scala Vs Python

2016-08-31 Thread Santoshakhilesh
Hi , I would prefer Scala if you are starting afresh , this is considering both ease of usage , features , performance and support. You will find numerous examples & support with Scala which might not be true for any other language. I had personally developed the first version of my App using

KeyManager exception in Spark 1.6.2

2016-08-31 Thread Eric Ho
I was trying to enable SSL in Spark 1.6.2 and got this exception. Not sure if I'm missing something or my keystore / truststore files got bad although keytool showed that both files are fine... = *16/09/01 04:01:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your

Re: Spark build 1.6.2 error

2016-08-31 Thread Divya Gehlot
Which java version are you using ? On 31 August 2016 at 04:30, Diwakar Dhanuskodi wrote: > Hi, > > While building Spark 1.6.2 , getting below error in spark-sql. Much > appreciate for any help. > > ERROR] missing or invalid dependency detected while loading class

[Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-08-31 Thread Divya Gehlot
Hi, I am using Spark 1.6.1 in EMR machine I am trying to read s3 buckets in my Spark job . When I read it through Spark shell I am able to read it ,but when I try to package the job and and run it as spark submit I am getting below error 16/08/31 07:36:38 INFO ApplicationMaster: Registered signal

Re:Scala Vs Python

2016-08-31 Thread 时金魁
Must be Scala, or Java. 在 2016-09-01 10:02:54,"ayan guha" 写道: Hi Users Thought to ask (again and again) the question: While I am building any production application, should I use Scala or Python? I have read many if not most articles but all seems pre-Spark 2.

Scala Vs Python

2016-08-31 Thread ayan guha
Hi Users Thought to ask (again and again) the question: While I am building any production application, should I use Scala or Python? I have read many if not most articles but all seems pre-Spark 2. Anything changed with Spark 2? Either pro-scala way or pro-python way? I am thinking

Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
here is an example: df1 = df0.select(explode("manager.subordinates.subordinate_clerk .duties).alias("duties-flat"), col("duties-flat.duty.name"").alias("duty-name")) this is in pyspark, i may have some part of this wrong, didn't

RE: AnalysisException exception while parsing XML

2016-08-31 Thread srikanth.jella
How do we explode nested arrays? Thanks, Sreekanth Jella From: Peyman Mohajerian

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
Yes, I just tested it against the nightly build from 8/31. Looking at the PR, I'm happy the test added verifies my issue. Thanks. -Don On Wed, Aug 31, 2016 at 6:49 PM, Hyukjin Kwon wrote: > Hi Don, I guess this should be fixed from 2.0.1. > > Please refer this PR.

Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1. Please refer this PR. https://github.com/apache/spark/pull/14339 On 1 Sep 2016 2:48 a.m., "Don Drake" wrote: > I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark > 2.0 and have encountered some

Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
Once you get to the 'Array' type, you got to use explode, you cannot to the same traversing. On Wed, Aug 31, 2016 at 2:19 PM, wrote: > Hello Experts, > > > > I am using Spark XML package to parse the XML. Below exception is being > thrown when trying to *parse a tag

Re: Expected benefit of parquet filter pushdown?

2016-08-31 Thread Robert Kruszewski
Your statistics seem corrupted. The creator filed doesn’t match the version spec and as such parquet is not using it to filter. I would check whether you have references to PARQUET-251 or PARQUET-297 in your executor logs. This bug existed between parquet 1.5.0 and 1.8.0. Checkout

Fwd: Pyspark Hbase Problem

2016-08-31 Thread md mehrab
I want to read and write data from hbase using pyspark. I am getting below error plz help My code from pyspark import SparkContext, SQLContext sc = SparkContext() sqlContext = SQLContext(sc) sparkconf = { "hbase.zookeeper.quorum": "localhost", "hbase.mapreduce.inputtable": "test" }

Expected benefit of parquet filter pushdown?

2016-08-31 Thread Christon DeWan
I have a data set stored in parquet with several short key fields and one relatively large (several kb) blob field. The data set is sorted by key1, key2. message spark_schema { optional binary key1 (UTF8); optional binary key2; optional binary blob; } One use case of

AnalysisException exception while parsing XML

2016-08-31 Thread srikanth.jella
Hello Experts, I am using Spark XML package to parse the XML. Below exception is being thrown when trying to parse a tag which exist in arrays of array depth. i.e. in this case subordinate_clerk. .duty.name With below sample XML, issue is reproducible: 1 mgr1

Spark jobs failing by looking for TachyonFS

2016-08-31 Thread Venkatesh Rudraraju
My spark job fails by trying to initialize Tachyon FS though not configured to. With *Standalone*-setup in my laptop the job *succeeds*. But with *Medos+docker* setup it *fails*(once every few runs). Pls reply if anyone has seen this or know why its looking for Tachyon at all. Below is the

Custom return code

2016-08-31 Thread Pierre Villard
Hi, I am using Spark 1.5.2 and I am submitting a job (jar file) using spark-submit command in a yarn cluster mode. I'd like the command to return a custom return code. In the run method, if I do: sys.exit(myCode) the command will always return 0. The only way to have something not equal to 0 is

Re: Model abstract class in spark ml

2016-08-31 Thread Mohit Jaggi
Thanks Cody. That was a good explanation! Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 31, 2016, at 7:32 AM, Cody Koeninger wrote: > > http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ > > On Wed, Aug 31, 2016 at 2:19 AM,

Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Don Drake
I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark 2.0 and have encountered some interesting issues. First, it seems the SQL parsing is different, and I had to rewrite some SQL that was doing a mix of inner joins (using where syntax, not inner) and outer joins to get the SQL

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Luciano Resende
I believe the encryption section should be updated now that a bunch of related jiras were fixed yerterday such as https://issues.apache.org/jira/browse/SPARK-5682 On Wed, Aug 31, 2016 at 9:46 AM, Cody Koeninger wrote: > http://spark.apache.org/docs/latest/security.html > >

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
http://spark.apache.org/docs/latest/security.html On Wed, Aug 31, 2016 at 11:15 AM, Mihai Iacob wrote: > Does Spark support encryption for inter node communication ? > > > Regards, > > *Mihai Iacob* > Next Generation Platform - Security > IBM Analytics > > >

Re: releasing memory without stopping the spark context ?

2016-08-31 Thread Mich Talebzadeh
Spark memory is the sum of execution memory and storage memory. unpersist only removes the storage memory. Execution memory is there which is what is Spark all about. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Mihai Iacob
Does Spark support encryption for inter node communication ?   Regards,  Mihai IacobNext Generation Platform - Security IBM

RE: releasing memory without stopping the spark context ?

2016-08-31 Thread Rajani, Arpan
Removing Data Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method. (Copied from

releasing memory without stopping the spark context ?

2016-08-31 Thread Cesar
Is there a way to release all persisted RDD's/DataFrame's in Spark without stopping the SparkContext ? Thanks a lot -- Cesar Flores

Re: Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Mich Talebzadeh
Hi, Are you caching RDD into storage memory here? Example s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) Do you have a snapshot of your storage tab? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Random Forest Classification

2016-08-31 Thread Bryan Cutler
I see. You might try this, create a pipeline of just your feature transformers, then call fit() on the complete dataset to get a model. Finally make second pipeline and add this model and the decision tree as stages. On Aug 30, 2016 8:19 PM, "Bahubali Jain" wrote: > Hi

Iterative update for LocalLDAModel

2016-08-31 Thread jamborta
Hi, I am trying to take the OnlineLDAOptimizer and apply it iteratively to new data. My use case would be: - Train the model using the DistributedLDAModel - Convert to LocalLDAModel - Apply to new data as data comes in using the OnlineLDAOptimizer I cannot see that this can be done without

Re: Spark to Kafka communication encrypted ?

2016-08-31 Thread Cody Koeninger
Encryption is only available in spark-streaming-kafka-0-10, not 0-8. You enable it the same way you enable it for the Kafka project's new consumer, by setting kafka configuration parameters appropriately. http://kafka.apache.org/documentation.html#security_ssl On Wed, Aug 31, 2016 at 2:03 AM,

Re: Model abstract class in spark ml

2016-08-31 Thread Cody Koeninger
http://blog.originate.com/blog/2014/02/27/types-inside-types-in-scala/ On Wed, Aug 31, 2016 at 2:19 AM, Sean Owen wrote: > Weird, I recompiled Spark with a similar change to Model and it seemed > to work but maybe I missed a step in there. > > On Wed, Aug 31, 2016 at 6:33 AM,

Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Jakub Dubovsky
Hey all, I have a conceptual question which I have hard time finding answer for. Is the jvm where spark driver is running also used to run computations over rdd partitions and persist them? The answer is obvious for local mode (yes). But when it runs on yarn/mesos/standalone with many executors

Re: Spark build 1.6.2 error

2016-08-31 Thread Adam Roberts
Looks familiar, got the zinc server running and using a shared dev box? ps -ef | grep "com.typesafe zinc.Nailgun", look for the zinc server process, kill it and try again, Spark branch-1.6 builds great here from scratch, had plenty of problems thanks to running the zinc server here (started

Grouping on bucketed and sorted columns

2016-08-31 Thread Fridtjof Sander
Hi Spark users, I'm currently investigating spark's bucketing and partitioning capabilities and I have some questions: Let /T/ be a table that is bucketed and sorted by /T.id/ and partitioned by /T.date/. Before persisting, /T/ has been repartitioned by /T.id/ to get only one file per

Re: Why does spark take so much time for simple task without calculation?

2016-08-31 Thread Bedrytski Aliaksandr
Hi xiefeng, Spark Context initialization takes some time and the tool does not really shine for small data computations: http://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html But, when working with terabytes (petabytes) of data, those 35 seconds of initialization

Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of Spark and Matlab are right. Thanks On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote: > > just looking at a comparision between Matlab and Spark for svd with an > input matrix N > > > this is

Problem with Graphx and number of partitions

2016-08-31 Thread alvarobrandon
Helo everyone: I have a problem when setting the number of partitions inside Graphx with the ConnectedComponents function. When I launch the application with the default number of partition everything runs smoothly. However when I increase the number of partitions to 150 for example ( it happens

Re: Spark build 1.6.2 error

2016-08-31 Thread Nachiketa
Hi Diwakar, Could you please share the entire maven command that you are using to build ? And also the JDK version you are using ? Also could you please confirm that you did execute the script for change scala version to 2.11 before starting the build ? Thanks. Regards, Nachiketa On Wed, Aug

Why does spark take so much time for simple task without calculation?

2016-08-31 Thread xiefeng
I install a spark standalone and run the spark cluster(one master and one worker) in a windows 2008 server with 16cores and 24GB memory. I have done a simple test: Just create a string RDD and simply return it. I use JMeter to test throughput but the highest is around 35/sec. I think spark is

Re: Model abstract class in spark ml

2016-08-31 Thread Sean Owen
Weird, I recompiled Spark with a similar change to Model and it seemed to work but maybe I missed a step in there. On Wed, Aug 31, 2016 at 6:33 AM, Mohit Jaggi wrote: > I think I figured it out. There is indeed "something deeper in Scala” :-) > > abstract class A { > def

Re: Slow activation using Spark Streaming's new receiver scheduling mechanism

2016-08-31 Thread Renxia Wang
I do also have this problem. The total time for launching receivers seems related to the total number of executors. In my case, when I run 400 executors with 200 receivers, it takes about a minute for all receivers become active, but with 800 executors, it takes 3 minutes to activate all

Spark to Kafka communication encrypted ?

2016-08-31 Thread Eric Ho
I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark to Kafka communication ... I think that the Spark docs only tells you how to turn on encryption for inter Spark node communications .. Am I wrong ? Thanks. -- -eric ho