Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
I suppose that it's hardly possible that this issue is connected with the string encoding, because - "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code -

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke
Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > >

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
For some reason my pasted screenshots were removed when I sent the email (at least that's how it appeared on my end). Repasting as text below. The sequence you are referring to represents the list of column names to fill. I am asking about filling a column which is of type list with an empty

Re: Dataframes na fill with empty list

2017-04-11 Thread Sumona Routh
The sequence you are referring to represents the list of column names to fill. I am asking about filling a column which is of type list with an empty list. Here is a quick example of what I am doing: The output of the show and printSchema for the collectList df: So, the last line which

Re: Dataframes na fill with empty list

2017-04-11 Thread Didac Gil
It does support it, at least in 2.0.2 as I am running: Here one example: val parsedLines = stream_of_logs .map(line => p.parseRecord_viaCSVParser(line)) .join(appsCateg,$"Application"===$"name","left_outer") .drop("id") .na.fill(0, Seq(“numeric_field1”,"numeric_field2")) .na.fill("",

Re: [DataFrames] map function - 2.0

2016-12-15 Thread Michael Armbrust
Experimental in Spark really just means that we are not promising binary compatibly for those functions in the 2.x release line. For Datasets in particular, we want a few releases to make sure the APIs don't have any major gaps before removing the experimental tag. On Thu, Dec 15, 2016 at 1:17

RE: Dataframes

2016-02-11 Thread Prashant Verma
Hi Gaurav, You can try something like this. SparkConf conf = new SparkConf(); JavaSparkContext sc = new JavaSparkContext(conf); SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); Class.forName("com.mysql.jdbc.Driver"); String url="url"; Properties prop = new

Re: Dataframes

2016-02-11 Thread Rishabh Wadhawan
Hi Gaurav I am not sure what you are trying to do here as you are naming two data frames with the same name which would be a compilation error in java. However, after trying to see what you are asking, as of what I understand your question is. You can do something like this; > SqlContext

Re: Dataframes

2016-02-11 Thread Ted Yu
bq. Whether sContext(SQlCOntext) will help to query in both the dataframes and will it decide on which dataframe to query for . Can you clarify what you were asking ? The queries would be carried out on respective DataFrame's as shown in your snippet. On Thu, Feb 11, 2016 at 8:47 AM, Gaurav

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Zhan Zhang
When you have following query, 'account=== “acct1” will be pushdown to generate new query with “where account = acct1” Thanks. Zhan Zhang On Nov 18, 2015, at 11:36 AM, Eran Medan > wrote: I understand that the following are equivalent

Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex, If not, you can try using the functions coalesce(n) or repartition(n). As per the API, coalesce will not make a shuffle but repartition will. Regards. 2015-10-16 0:52 GMT+01:00 Mohammed Guller : > You may find the spark.sql.shuffle.partitions property useful. The

RE: dataframes and numPartitions

2015-10-15 Thread Mohammed Guller
You may find the spark.sql.shuffle.partitions property useful. The default value is 200. Mohammed From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Wednesday, October 14, 2015 8:14 PM To: user Subject: dataframes and numPartitions A lot of RDD methods take a numPartitions

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Jerry Lam
I just read the article by ogirardot but I don’t agree It is like saying pandas dataframe is the sole data structure for analyzing data in python. Can Pandas dataframe replace Numpy array? The answer is simply no from an efficiency perspective for some computations. Unless there is a computer

Re: Dataframes - sole data structure for parallel computations?

2015-10-08 Thread Michael Armbrust
Don't worry, the ability to work with domain objects and lambda functions is not going to go away. However, we are looking at ways to leverage Tungsten's improved performance when processing structured data. More details can be found here: https://issues.apache.org/jira/browse/SPARK- On

Re: dataframes sql order by not total ordering

2015-07-21 Thread Carol McDonald
Thanks, that works a lot better ;) scala val results =sqlContext.sql(select movies.title, movierates.maxr, movierates.minr, movierates.cntu from(SELECT ratings.product, max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct user) as cntu FROM ratings group by ratings.product )

Re: dataframes sql order by not total ordering

2015-07-20 Thread Michael Armbrust
An ORDER BY needs to be on the outermost query otherwise subsequent operations (such as the join) could reorder the tuples. On Mon, Jul 20, 2015 at 9:25 AM, Carol McDonald cmcdon...@maprtech.com wrote: the following query on the Movielens dataset , is sorting by the count of ratings for a

Re: DataFrames for non-SQL computation?

2015-06-11 Thread Michael Armbrust
Yes, DataFrames are for much more than SQL and I would recommend using them where ever possible. It is much easier for us to do optimizations when we have more information about the schema of your data, and as such, most of our on going optimization effort will focus on making DataFrames faster.

Re: DataFrames coming in SparkR in Apache Spark 1.4.0

2015-06-03 Thread Emaasit
You can build Spark from the 1.4 release branch yourself: https://github.com/apache/spark/tree/branch-1.4 - Daniel Emaasit, Ph.D. Research Assistant Transportation Research Center (TRC) University of Nevada, Las Vegas Las Vegas, NV 89154-4015 Cell: 615-649-2489 www.danielemaasit.com --

Re: Dataframes Question

2015-04-19 Thread Ted Yu
That's right. On Sun, Apr 19, 2015 at 8:59 AM, Arun Patel arunp.bigd...@gmail.com wrote: Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq.

Re: Dataframes Question

2015-04-19 Thread Ted Yu
bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background: http://search-hadoop.com/m/JW1q5zQ1Xw/spark+DataFrame+schemarddsubj=renaming+SchemaRDD+gt+DataFrame On Sat, Apr 18, 2015 at 5:43 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I am no

Re: Dataframes Question

2015-04-19 Thread Arun Patel
Thanks Ted. So, whatever the operations I am performing now are DataFrames and not SchemaRDD? Is that right? Regards, Venkat On Sun, Apr 19, 2015 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote: bq. SchemaRDD is not existing in 1.3? That's right. See this thread for more background:

Re: Dataframes Question

2015-04-18 Thread Abhishek R. Singh
I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On