Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread ayan guha
I think recommended use will be creating a dataframe using hbase as source. Then you can run any SQL on that DF. In 1.2 you can create a base rdd and then apply schema in the same manner On 21 Apr 2015 03:12, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using

Re: Updating a Column in a DataFrame

2015-04-20 Thread ayan guha
You can always create another DF using a map. In reality operations are lazy so only final value will get computed. Can you provide the usecase in little more detail? On 21 Apr 2015 08:39, ARose ashley.r...@telarix.com wrote: In my Java application, I want to update the values of a Column in a

Re: Map-Side Join in Spark

2015-04-21 Thread ayan guha
If you are using a pairrdd, then you can use partition by method to provide your partitioner On 21 Apr 2015 15:04, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: What is re-partition ? On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote: In my understanding you need to create

Re: Map-Side Join in Spark

2015-04-20 Thread ayan guha
In my understanding you need to create a key out of the data and repartition both datasets to achieve map side join. On 21 Apr 2015 14:10, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Can someone share their working code of Map Side join in Spark + Scala. (No Spark-SQL) The only resource i could

Re: invalidate caching for hadoopFile input?

2015-04-20 Thread ayan guha
You can use rdd.unpersist. its documented in spark programming guide page under Removing Data section. Ayan On 21 Apr 2015 13:16, Wei Wei vivie...@gmail.com wrote: Hey folks, I am trying to load a directory of avro files like this in spark-shell: val data =

Spark 1.3.1 - SQL Issues

2015-04-20 Thread ayan guha
-hadoop2.6\python\pyspark\mllib\recommendation.py, line 127, in _prepare assert isinstance(ratings, RDD), ratings should be RDD AssertionError: ratings should be RDD -- Best Regards, Ayan Guha

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Custom Partitioning Spark

2015-04-21 Thread ayan guha
solely for the person(s) named and may be confidential and/or privileged.If you are not the intended recipient,please delete it,notify me and do not copy,use,or disclose its content.* -- Best Regards, Ayan Guha

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread ayan guha
/Column-renaming-after-DataFrame-groupBy-tp22586.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
solution? I am thinking to map the training dataframe back to a RDD, byt will lose the schema information. Best Ayan On Mon, Apr 20, 2015 at 10:23 PM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning Warning (from warnings module): File D

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread ayan guha
What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, sequoiadb mailing-list-r...@sequoiadb.com

Re: directory loader in windows

2015-04-25 Thread ayan guha
) print newsY.count() On 25 April 2015 at 20:08, ayan guha guha.a...@gmail.com wrote: Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = D:\\Project\\Spark\\code\\news\\jsonfeeds newsY = sc.textFile(loc

Re: Querying Cluster State

2015-04-26 Thread ayan guha
that are currently available using API calls and then take some appropriate action based on the information I get back, like restart a dead Master or Worker. Is this possible? does Spark provide such API? -- Best Regards, Ayan Guha

Re: Querying Cluster State

2015-04-26 Thread ayan guha
On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote: In my limited understanding, there must be single leader master in the cluster. If there are multiple leaders, it will lead to unstable cluster as each masters will keep scheduling independently. You should use zookeeper

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
#org.apache.spark.ml.recommendation.ALS In the examples/ directory for ml/, you can find a MovieLensALS example. Good luck! Joseph On Tue, Apr 21, 2015 at 4:58 AM, ayan guha guha.a...@gmail.com wrote: Hi I am getting an error Also, I am getting an error in mlib.ALS.train function when passing

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
you! Best, Wenlei -- Best Regards, Ayan Guha

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested your pr On 25 Apr 2015 10:18, Ali Bajwa ali.ba...@gmail.com wrote: Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa ali.ba...@gmail.com wrote: Hi experts, Sorry if this is a n00b question or has

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
you so much for the help! On Sat, Apr 25, 2015 at 12:41 AM, ayan guha guha.a...@gmail.com wrote: can you give an example set of data and desired output On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wenlei@gmail.com wrote: Hi, I would like to answer the following customized aggregation

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread ayan guha
that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL). On Fri, Apr 24, 2015 at 6:27 PM, ayan guha guha.a...@gmail.com wrote: What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread ayan guha
, Ayan Guha

directory loader in windows

2015-04-25 Thread ayan guha
:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan

Re: Pipeline in pyspark

2015-04-23 Thread ayan guha
I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, Suraj Shetiya surajshet...@gmail.com wrote: Hi, I have come across ways of building pipeline of input/transform and output pipelines with Java (Google

Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable {

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
') But in Spark 1.4.0 this does not seem to make any difference anyway and the problem is the same with both versions. On 2015-04-21 17:04, ayan guha wrote: your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple

Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote: Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val

Re: 1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-28 Thread ayan guha
Can you show your code please? On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote: Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is

Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote: Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used

Re: Dataframe filter based on another Dataframe

2015-04-29 Thread ayan guha
Regards, Ayan Guha

Re: DataFrame filter referencing error

2015-04-29 Thread ayan guha
) at java.lang.Thread.run(Thread.java:745) Does filter work only on columns of the integer type? What is the exact behaviour of the filter function and what is the best way to handle the query I am trying to execute? Thank you, Francesco -- Best Regards, Ayan Guha

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-29 Thread ayan guha
I guess what you mean is not streaming. If you create a stream context at time t, you will receive data coming through starting time t++, not before time t. Looks like you want a queue. Let Kafka write to a queue, consume msgs from the queue and stop when queue is empty. On 29 Apr 2015 14:35,

Re: Initial tasks in job take time

2015-04-28 Thread ayan guha
Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec.

RE: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-30 Thread ayan guha
it using DataFrame? Can you give an example code snipet? Thanks Ningjun *From:* ayan guha [mailto:guha.a...@gmail.com] *Sent:* Wednesday, April 29, 2015 5:54 PM *To:* Wang, Ningjun (LNG-NPV) *Cc:* user@spark.apache.org *Subject:* Re: HOw can I merge multiple DataFrame and remove duplicated key

Re: HOw can I merge multiple DataFrame and remove duplicated key

2015-04-29 Thread ayan guha
Its no different, you would use group by and aggregate function to do so. On 30 Apr 2015 02:15, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have multiple DataFrame objects each stored in a parquet file. The DataFrame just contains 3 columns (id, value, timeStamp). I need to

Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :) The fact that query runtime increases indicates more shuffle. You may want to construct rdds based on keys you use. You may want to specify what kind of node you are using and how many executors you are using. You may also want to play around with executor memory

Re: Automatic Cache in SparkSQL

2015-04-27 Thread ayan guha
Spark keeps job in memory by default for kind of performance gains you are seeing. Additionally depending on your query spark runs stages and any point of time spark's code behind the scene may issue explicit cache. If you hit any such scenario you will find those cached objects in UI under

Re: Scalability of group by

2015-04-27 Thread ayan guha
Hi Can you test on a smaller dataset to identify if it is cluster issue or scaling issue in spark On 28 Apr 2015 11:30, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in Spark 1.3 as follows: “select id,

Re: New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-28 Thread ayan guha
Alias function not in python yet. I suggest to write SQL if your data suits it On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread ayan guha
Yes it is possible. You need to use jsonfile method on SQL context and then create a dataframe from the rdd. Then register it as a table. Should be 3 lines of code, thanks to spark. You may see few YouTube video esp for unifying pipelines. On 3 May 2015 19:02, Jai jai4l...@gmail.com wrote: Hi,

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread ayan guha
You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote: I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm

Re: Hardware requirements

2015-05-04 Thread ayan guha
Hi How do you figure out 500gig~3900 partitions? I am trying to do the math. If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we assume split and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread ayan guha
? Thanks, Lior -- Best Regards, Ayan Guha

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread ayan guha
...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Python Custom Partitioner

2015-05-04 Thread ayan guha
path? b) How can I do partitionby? Specifically, when I call DF.rdd.partitionBy, what gets passed to the custom function? tuple? row? how to access (say 3rd column of a tuple inside partitioner function)? -- Best Regards, Ayan Guha

Re: How to add a column to a spark RDD with many columns?

2015-05-01 Thread ayan guha
You have rdd or dataframe? Rdds are kind of tuples. You can add a new column to it by a map. rdd s are immutable, so you will get another rdd. On 1 May 2015 14:59, Carter gyz...@hotmail.com wrote: Hi all, I have a RDD with *MANY *columns (e.g., *hundreds*), how do I add one more column at the

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread ayan guha
And if I may ask, how long it takes in hbase CLI? I would not expect spark to improve performance of hbase. At best spark will push down the filter to hbase. So I would try to optimise any additional overhead like bringing data into spark. On 1 May 2015 00:56, Ted Yu yuzhih...@gmail.com wrote:

Re: DataFrame filter referencing error

2015-04-30 Thread ayan guha
PM ayan guha guha.a...@gmail.com wrote: Looks like you DF is based on a MySQL DB using jdbc, and error is thrown from mySQL. Can you see what SQL is finally getting fired in MySQL? Spark is pushing down the predicate to mysql so its not a spark problem perse On Wed, Apr 29, 2015 at 9:56 PM

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl

Re: How to group multiple row data ?

2015-04-29 Thread ayan guha
commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: directory loader in windows

2015-05-02 Thread ayan guha
(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) -- Best Regards, Ayan Guha

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages. 2015-05-05 3:49 GMT-07:00 Kartik Mehta kartik.meht...@gmail.com: I too have similar question. My understanding is since Spark

Re: Unable to join table across data sources using sparkSQL

2015-05-05 Thread ayan guha
.1001560.n3.nabble.com/Unable-to-join-table-across-data-sources-using-sparkSQL-tp22761p22768.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

Re: Maximum Core Utilization

2015-05-05 Thread ayan guha
Also, if not already done, you may want to try repartition your data to 50 partition s On 6 May 2015 05:56, Manu Kaul manohar.k...@gmail.com wrote: Hi All, For a job I am running on Spark with a dataset of say 350,000 lines (not big), I am finding that even though my cluster has a large

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread ayan guha
it to a tuple2 seems like a waste of space/computation. It looks like the PairRDDFunctions..partitionBy() uses a ShuffleRDD[K,V,C] requires K,V,C? Could I create a new ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)? Cheers, N -- Best Regards, Ayan Guha

Re: Creating topology in spark streaming

2015-05-06 Thread ayan guha
Every transformation on a dstream will create another dstream. You may want to take a look at foreachrdd? Also, kindly share your code so people can help better On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote: Please help guys, Even After going through all the examples given i

Re: JAVA for SPARK certification

2015-05-05 Thread ayan guha
for Spark certification, learning in group makes learning easy and fun. Kartik On May 5, 2015 7:31 AM, ayan guha guha.a...@gmail.com wrote: And how important is to have production environment? On 5 May 2015 20:51, Stephen Boesch java...@gmail.com wrote: There are questions in all three languages

Re: Receiver Fault Tolerance

2015-05-06 Thread ayan guha
. Is the above understanding correct? or is there more to it? -- Best Regards, Ayan Guha

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread ayan guha
be forced. Any ideas? -- Best Regards, Ayan Guha

Re: saveAsTextFile() to save output of Spark program to HDFS

2015-05-05 Thread ayan guha
What happens when you try to put files to your hdfs from local filesystem? Looks like its a hdfs issue rather than spark thing. On 6 May 2015 05:04, Sudarshan njmu...@gmail.com wrote: I have searched all replies to this question not found an answer. I am running standalone Spark 1.3.1 and

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread ayan guha
From S3. As the dependency of df will be on s3. And because rdds are not replicated. On 8 May 2015 23:02, Peter Rudenko petro.rude...@gmail.com wrote: Hi, i have a next question: val data = sc.textFile(s3:///)val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) if during

Re: Python - SQL (geonames dataset)

2015-05-11 Thread ayan guha
Try this Res = ssc.sql(your SQL without limit) Print red.first() Note: your SQL looks wrong as count will need a group by clause. Best Ayan On 11 May 2015 16:22, Tyler Mitchell tyler.mitch...@actian.com wrote: I'm using Python to setup a dataframe, but for some reason it is not being made

Re: Reading Nested Fields in DataFrames

2015-05-11 Thread ayan guha
Typically you would use . notation to access, same way you would access a map. On 12 May 2015 00:06, Ashish Kumar Singh ashish23...@gmail.com wrote: Hi , I am trying to read Nested Avro data in Spark 1.3 using DataFrames. I need help to retrieve the Inner element data in the Structure below.

Re: can we start a new thread in foreachRDD in spark streaming?

2015-05-11 Thread ayan guha
It depends on how you want to run your application. You can always save 100 batch as a data file and run another app to read those files. In that case you have separated contexts and you will find both application running simultaneously in the cluster but on different JVMs. But if you do not want

Re: Python Custom Partitioner

2015-05-04 Thread ayan guha
Thanks, but is there non broadcast solution? On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have implemented map-side join with broadcast variables and the code is on mailing list (scala). On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote: Hi Can

Re: custom join using complex keys

2015-05-10 Thread ayan guha
with a given predicate to implement this ? (I would probably also need to provide a partitioner, and some sorting predicate). Left and right RDD are 1-10 millions lines long. Any idea ? Thanks Mathieu -- Best Regards, Ayan Guha

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
How did you end up with thousands of df? Are you using streaming? In that case you can do foreachRDD and keep merging incoming rdds to single rdd and then save it through your own checkpoint mechanism. If not, please share your use case. On 11 May 2015 00:38, Peter Aberline

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
file. They have the same schema. There is also the option of appending each DF to the parquet file, but then I can't maintain them as separate DF when reading back in without filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha

Re: spark and binary files

2015-05-09 Thread ayan guha
-- Best Regards, Ayan Guha

Re: CREATE TABLE ignores database when using PARQUET option

2015-05-08 Thread ayan guha
I am just wondering if create table supports the syntax of Create table dB.tablename Instead of two step process of use dB and then create table tablename? On 9 May 2015 08:17, Michael Armbrust mich...@databricks.com wrote: Actually, I was talking about the support for inferring different but

Re: Map one RDD into two RDD

2015-05-08 Thread ayan guha
Do as Evo suggested. Rdd1=rdd.filter, rdd2=rdd.filter On 9 May 2015 05:19, anshu shukla anshushuk...@gmail.com wrote: Any update to above mail and Can anyone tell me logic - I have to filter tweets and submit tweets with particular #hashtag1 to SparkSQL databases and tweets with

Re: IF in SQL statement

2015-05-16 Thread ayan guha
() thx, Antony. -- Best Regards, Ayan Guha

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Here is from documentation: Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1. On Sun, May 17, 2015 at 1:48 AM, ayan guha guha.a...@gmail.com wrote: Hi Try with Hive 0.13. If I am not wrong, Hive 0.14

Re: Custom Aggregate Function for DataFrame

2015-05-16 Thread ayan guha
the performance. Thanks. Justin On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote: can you kindly elaborate on this? it should be possible to write udafs in similar lines of sum/min etc. On Fri, May 15, 2015 at 5:49 AM, Justin Yip yipjus...@prediction.io wrote: Hello, May I

Re: reduceByKey

2015-05-14 Thread ayan guha
: *2553: 0,0,0,1,0,1,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *225546: 0,0,0,0,0,2,0,0* Anyone can help me getting that? Thank you. Have a nice day. yasemin -- hiç ender hiç -- Best Regards, Ayan Guha

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
) lines.count() On Thu, May 14, 2015 at 4:17 AM, ayan guha guha.a...@gmail.com wrote: Jo Thanks for the reply, but _jsc does not have anything to pass hadoop configs. can you illustrate your answer a bit more? TIA... On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com wrote

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote: Hi Friends, please someone can give the

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread ayan guha
-Function-for-DataFrame-tp22893.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Best Regards, Ayan Guha

Re: Worker Spark Port

2015-05-15 Thread ayan guha
...@gmail.com wrote: I understated that this port value is randomly selected. Is there a way to enforce which spark port a Worker should use? -- Best Regards, Ayan Guha

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina. -- Best Regards, Ayan Guha

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread ayan guha
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
(jsc) https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext through which you can access the hadoop configuration On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote: Hi I found this method in scala API but not in python API (1.3.1). Basically, I

Re: Processing multiple columns in parallel

2015-05-18 Thread ayan guha
My first thought would be creating 10 rdds and run your word count on each of them..I think spark scheduler is going to resolve dependency in parallel and launch 10 jobs. Best Ayan On 18 May 2015 23:41, Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, Consider I have a tab delimited text

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-18 Thread ayan guha
Hi So to be clear, do you want to run one operation in multiple threads within a function or you want run multiple jobs using multiple threads? I am wondering why python thread module can't be used? Or you have already gave it a try? On 18 May 2015 16:39, MEETHU MATHEW meethu2...@yahoo.co.in

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-13 Thread ayan guha
Your stack trace says it can't convert date to integer. You sure about column positions? On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in wrote: Hi , I am using Spark SQL 1.3.1. I have created a dataFrame using jdbc data source and am using saveAsTable() method but got

Re: how to set random seed

2015-05-14 Thread ayan guha
the seed (call random.seed()) once on each worker? -- *From:* ayan guha guha.a...@gmail.com *Sent:* Tuesday, May 12, 2015 11:17 PM *To:* Charles Hayden *Cc:* user *Subject:* Re: how to set random seed Easiest way is to broadcast it. On 13 May 2015 10:40, Charles

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
and create a logical plan. Even if i have just one row, it's taking more than 1 hour just to get pass the parsing. Any idea how to optimize in these kind of scenarios? Regards, Madhukara Phatak http://datamantra.io/ -- Best Regards, Ayan Guha

Re: how to set random seed

2015-05-13 Thread ayan guha
Easiest way is to broadcast it. On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote: In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking

Using sc.HadoopConfiguration in Python

2015-05-12 Thread ayan guha
, how? -- Best Regards, Ayan Guha

Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
the schema, I am specifying every field as nullable. So I believe, it should not throw this error. Can anyone help me fix this error. Thank you. Regards, Anand.C -- Best Regards, Ayan Guha

Re: Spark Job not using all nodes in cluster

2015-05-19 Thread ayan guha
What is your spark env file says? Are you setting number of executors in spark context? On 20 May 2015 13:16, Shailesh Birari sbirar...@gmail.com wrote: Hi, I have a 4 node Spark 1.3.1 cluster. All four nodes have 4 cores and 64 GB of RAM. I have around 600,000+ Json files on HDFS. Each file

Re: Hive on Spark VS Spark SQL

2015-05-20 Thread ayan guha
And if I am not wrong, spark SQL api is intended to move closer to SQL standards. I feel its a clever decision on spark's part to keep both APIs operational. These short term confusions worth the long term benefits. On 20 May 2015 17:19, Sean Owen so...@cloudera.com wrote: I don't think that's

Re: Spark SQL on large number of columns

2015-05-19 Thread ayan guha
are you using 发自我的 iPhone 在 2015年5月19日,18:29,ayan guha guha.a...@gmail.com 写道: can you kindly share your code? On Tue, May 19, 2015 at 8:04 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying run spark sql aggregation on a file with 26k columns. No of rows is very small. I am

Re: Spark 1.3.1 - SQL Issues

2015-05-20 Thread ayan guha
Thanks a bunch On 21 May 2015 07:11, Davies Liu dav...@databricks.com wrote: The docs had been updated. You should convert the DataFrame to RDD by `df.rdd` On Mon, Apr 20, 2015 at 5:23 AM, ayan guha guha.a...@gmail.com wrote: Hi Just upgraded to Spark 1.3.1. I am getting an warning

Re: Python Image Library and Spark

2015-06-03 Thread ayan guha
Try with large number of partition in parallelize. On 4 Jun 2015 06:28, Justin Spargur jmspar...@gmail.com wrote: Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty

Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
Another option is merge partfiles after your app ends. On 5 Jun 2015 20:37, Akhil Das ak...@sigmoidanalytics.com wrote: you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best

Re: Saving calculation to single local file

2015-06-05 Thread ayan guha
is the better solution? Best Regards Marcos -- Best Regards, Ayan Guha

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread ayan guha
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best Regards, Ayan Guha

Re: Managing spark processes via supervisord

2015-06-05 Thread ayan guha
, ect.. ) in order to have the cluster up an running after boot-up; although I'd like to understand if it will cause more issues than it solves. Thanks, Mike. -- Best Regards, Ayan Guha

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread ayan guha
operations like join, groupBy, agg, unionAll etc which are all transformations in RDD? Are they lazily evaluated or immediately executed? -- Best Regards, Ayan Guha

  1   2   3   4   5   6   7   8   >