Re: making dataframe for different types using spark-csv

2015-07-02 Thread Kohler, Curt E (ELS-STL)
You should be able to do something like this (assuming an input file formatted 
as:  String, IntVal, LongVal)


import org.apache.spark.sql.types._

val recSchema = StructType(List(StructField(strVal, StringType, false),
StructField(intVal, IntegerType, 
false),
StructField(longVal, LongType, false)))

val filePath = some path to your dataset

val df1 =  sqlContext.load(com.databricks.spark.csv, recSchema, Map(path - 
filePath , header - false, delimiter - ,, mode - FAILFAST))

From: Hafiz Mujadid hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com
Date: Wednesday, July 1, 2015 at 10:59 PM
To: Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com
Cc: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: making dataframe for different types using spark-csv

hi Mohammed Guller!

How can I specify schema in load method?



On Thu, Jul 2, 2015 at 6:43 AM, Mohammed Guller 
moham...@glassbeam.commailto:moham...@glassbeam.com wrote:
Another option is to provide the schema to the load method. One variant of the 
sqlContext.load takes a schema as a input parameter. You can define the schema 
programmatically as shown here:

https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Mohammed

From: Krishna Sankar [mailto:ksanka...@gmail.commailto:ksanka...@gmail.com]
Sent: Wednesday, July 1, 2015 3:09 PM
To: Hafiz Mujadid
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: making dataframe for different types using spark-csv

·  use .cast(...).alias('...') after the DataFrame is read.
·  sql.functions.udf for any domain-specific conversions.
Cheers
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]k/

On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid 
hafizmujadi...@gmail.commailto:hafizmujadi...@gmail.com wrote:
Hi experts!


I am using spark-csv to lead csv data into dataframe. By default it makes
type of each column as string. Is there some way to get dataframe of actual
types like int,double etc.?


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/making-dataframe-for-different-types-using-spark-csv-tp23570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




--
Regards: HAFIZ MUJADID


Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)


The best approach I've found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)

Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):


JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);

//Get your Data into a SchemaRDD and register the Table


// Query it

String hql =  SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from 
TABLE-NAME GROUP BY FIELD1, FIELD2;

JavaSchemaRDD result = hsc.hql(hql);

ListRow grp = result.collect();


for (int z = 2; z   row.length(); z++) {

  // Do something with the results

}

Curt


From: SiMaYunRui myl...@hotmail.commailto:myl...@hotmail.com
Date: Sunday, February 15, 2015 at 10:37 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Percentile example

hello,

I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66.

Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches.

Can anyone shed some light on this problem?

Regards



Re: Percentile Calculation

2015-01-28 Thread Kohler, Curt E (ELS-STL)
When I looked at this last fall, the only way that seemed to be available was 
to transform my data into SchemaRDDs, register them as tables and then use the 
Hive processor to calculate them with its built in percentile UDFs that were 
added in 1.2.


Curt


From: kundan kumar iitr.kun...@gmail.com
Sent: Wednesday, January 28, 2015 8:13 AM
To: user@spark.apache.org
Subject: Percentile Calculation

Is there any inbuilt function for calculating percentile over a dataset ?

I want to calculate the percentiles for each column in my data.


Regards,
Kundan


Re: Spark and S3 server side encryption

2015-01-28 Thread Kohler, Curt E (ELS-STL)
So, following up on your suggestion, I'm still having some problems getting the 
configuration changes recognized when my job run.


I've added jets3t.properties to the root of my application jar file that I
submit to Spark (via spark-submit).

I've verified that my jets3t.properties is at the root of my application
jar by executing jar tf app.jar.

I submit my job to the cluster with the following command.

nohup ./bin/spark-submit --verbose -jars lib/app.jar --master
spark://master-amazonaws.com:7077  --class com.elsevier.spark.SparkSync
lib/app.jar  out.log 



In my mainline of app.jar, I also added the following code:


log.info(System.getProperty(java.class.path));
InputStream in =
SparkSync.class.getClassLoader().getResourceAsStream(jets3t.properties);
log.info(getStringFromInputStream(in));

And I can see that the jets3t.properties I provided is found because it
outputs:

s3service.server-side-encryption=AES256

It's almost as if the hadoop/jets3t piece has already been initialized and
is ignoring my jets3t.properties.

I can get this all working inside of Eclipse by including the folder
containing my jets3t.properties.  But, I can't get things working when
trying to submit this to a spark stand-alone cluster.

Any insights would be appreciated.?


From: Thomas Demoor thomas.dem...@amplidata.com
Sent: Tuesday, January 27, 2015 4:41 AM
To: Kohler, Curt E (ELS-STL)
Cc: user@spark.apache.org
Subject: Re: Spark and S3 server side encryption

Spark uses the Hadoop filesystems.

I assume you are trying to use s3n:// which, under the hood, uses the 3rd party 
jets3t library. It is configured through the jets3t.properties file (google 
hadoop s3n jets3t) which you should put on Spark's classpath. The setting you 
are looking for is s3service.server-side-encryption

The last version of hadoop (2.6) introduces a new and improved s3a:// 
filesystem which has the official sdk from Amazon under the hood.


On Mon, Jan 26, 2015 at 10:01 PM, curtkohler 
c.koh...@elsevier.commailto:c.koh...@elsevier.com wrote:
We are trying to create a Spark job that writes out a file to S3 that
leverage S3's server side encryption for sensitive data. Typically this is
accomplished by setting the appropriate header on the put request, but it
isn't clear whether this capability is exposed in the Spark/Hadoop APIs.
Does anyone have any suggestions?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-S3-server-side-encryption-tp21377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org