Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Rex X
What type cluster you are running on? YARN? And what distribution? >>> >>> >>> >>> >>> >>> On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" < >>> hol...@pigscanfly.ca> wrote: >>> >>> You really

Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Rex X
Wish to use the Pivot Table feature of data frame which is available since Spark 1.6. But the spark of current cluster is version 1.5. Can we install Spark 2.0 on the master node to work around this? Thanks!

Re: How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
The data.csv need to be corrected: 1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating 1,A,95123,100,1 1,B,95124,102,2 1,A,95126,100,2 2,B,95123,200,1 2,B,95124,201,2 2,C,95124,203,1 3,A,95126,300,2 3,C,95124,280,1 4,C,95124,400,2 On Fri, Aug 26, 2016 at 4:54 AM, Rex X <d

How to make new composite columns by combining rows in the same group?

2016-08-26 Thread Rex X
1. Given following CSV file $cat data.csv ID,City,Zip,Price,Rating1,A,95123,100,01,B,95124,102,11,A,95126,100,12,B,95123,200,02,B,95124,201,12,C,95124,203,03,A,95126,300,13,C,95124,280,04,C,95124,400,1 We want to group by ID, and make new composite columns of Price and Rating based on the

Re: How to do this pairing in Spark?

2016-08-26 Thread Rex X
:46 AM, ayan guha <guha.a...@gmail.com> wrote: > Why 3 and 9 should be deleted? 3 can be paired with 1and 9 can be paired > with 8. > On 26 Aug 2016 11:00, "Rex X" <dnsr...@gmail.com> wrote: > >> 1. Given following CSV file >> >> > $cat data.

How to do this pairing in Spark?

2016-08-25 Thread Rex X
1. Given following CSV file > $cat data.csv > > ID,City,Zip,Flag > 1,A,95126,0 > 2,A,95126,1 > 3,A,95126,1 > 4,B,95124,0 > 5,B,95124,1 > 6,C,95124,0 > 7,C,95127,1 > 8,C,95127,0 > 9,C,95127,1 (a) where "ID" above is a primary key (unique), (b) for

What's the best way to find the Nearest Neighbor row of a matrix with 10billion rows x 300 columns?

2016-05-17 Thread Rex X
Each row of the given matrix is Vector[Double]. Want to find out the nearest neighbor row to each row using cosine similarity. The problem here is the complexity: O( 10^20 ) We need to do *blocking*, and do the row-wise comparison within each block. Any tips for best practice? In Spark, we have

How to select from table name using IF(condition, tableA, tableB)?

2016-03-15 Thread Rex X
I want to do a query based on a logic condition to query between two tables. select * from if(A>B, tableA, tableB) But "if" function in Hive cannot work within FROM above. Any idea how?

What is the best way to JOIN two 10TB csv files and three 100kb files on Spark?

2016-02-05 Thread Rex X
Dear all, The new DataFrame of spark is extremely fast. But out cluster have limited RAM (~500GB). What is the best way to do such a big table Join? Any sample code is greatly welcome! Best, Rex

Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
With Pandas dataframe , we can do query: >>> from numpy.random import randn>>> from pandas import DataFrame>>> df = >>> DataFrame(randn(10, 2), columns=list('ab'))>>> df.query('a > b') This SQL-select-like query

Re: Can we do dataframe.query like Pandas dataframe in spark?

2015-09-17 Thread Rex X
gt; df.where("a > b").show() > > (2) Spark Jobs > +--+---+ | a| b| > +--+---+ > |0.6697439215581628|0.23420961030968923| |0.9248996796756386| > 0.4146647917936366| +------+---+ > >

Re: What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
ornfra...@gmail.com> wrote: > >> I fear you have to do the plumbing all yourself. This is the same for all >> commercial and non-commercial libraries/analytics packages. It often also >> depends on the functional requirements on how you distribute. >> >> Le sam. 1

What is the best way to migrate existing scikit-learn code to PySpark?

2015-09-12 Thread Rex X
Hi everyone, What is the best way to migrate existing scikit-learn code to PySpark cluster? Then we can bring together the full power of both scikit-learn and spark, to do scalable machine learning. (I know we have MLlib. But the existing code base is big, and some functions are not fully

HOw to concatenate two csv files into one RDD?

2015-06-26 Thread Rex X
With Python Pandas, it is easy to do concatenation of dataframes by combining pandas.concat http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html and pandas.read_csv pd.concat([pd.read_csv(os.path.join(Path_to_csv_files, f)) for f in csvfiles]) where csvfiles is the list of

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
representation. -sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote: Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote: For clustering analysis, we need

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-16 Thread Rex X
Is it necessary to convert categorical data into integers? Any tips would be greatly appreciated! -Rex On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote: For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

2015-06-14 Thread Rex X
For clustering analysis, we need a way to measure distances. When the data contains different levels of measurement - *binary / categorical (nominal), counts (ordinal), and ratio (scale)* To be concrete, for example, working with attributes of *city, zip, satisfaction_level, price* In the

Re: [Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-13 Thread Rex X
) and once you have that as a DataFrame, SQL can do the rest. https://spark.apache.org/docs/latest/sql-programming-guide.html -Don On Fri, Jun 12, 2015 at 8:46 PM, Rex X dnsr...@gmail.com wrote: Hi, I want to use spark to select N columns, top M rows of all csv files under a folder

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a

[Spark] What is the most efficient way to do such a join and column manipulation?

2015-06-12 Thread Rex X
Hi, I want to use spark to select N columns, top M rows of all csv files under a folder. To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1