RE: 回复: how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread Sun, Rui
Spark does not support computing cov matrix  now. But there is a PR for it. 
Maybe you can try it: https://issues.apache.org/jira/browse/SPARK-11057


From: zhangjp [mailto:592426...@qq.com]
Sent: Tuesday, December 29, 2015 3:21 PM
To: Felix Cheung; Andy Davidson; Yanbo Liang
Cc: user
Subject: 回复: how to use sparkR or spark MLlib load csv file on hdfs 
thencalculate covariance


Now i have huge columns about 5k -20k, so if i want to Calculate covariance 
matrix ,which is the best method or common method ?

-- 原始邮件 --
发件人: "Felix 
Cheung";mailto:felixcheun...@hotmail.com>>;
发送时间: 2015年12月29日(星期二) 中午12:45
收件人: "Andy 
Davidson"mailto:a...@santacruzintegration.com>>; 
"zhangjp"<592426...@qq.com>; "Yanbo 
Liang"mailto:yblia...@gmail.com>>;
抄送: "user"mailto:user@spark.apache.org>>;
主题: Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate 
covariance

Make sure you add the csv spark package as this example here so that the source 
parameter in R read.df would work:


https://spark.apache.org/docs/latest/sparkr.html#from-data-sources

_
From: Andy Davidson 
mailto:a...@santacruzintegration.com>>
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang 
mailto:yblia...@gmail.com>>
Cc: user mailto:user@spark.apache.org>>

Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I would 
recommend you print the first couple of rows and also print the schema to make 
sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like



public class LoadTidyDataFrame {

static  DataFrame fromCSV(SQLContext sqlContext, String file) {

DataFrame df = sqlContext.read()

.format("com.databricks.spark.csv")

.option("inferSchema", "true")

.option("header", "true")

.load(file);



return df;

}

}



From: Yanbo Liang < yblia...@gmail.com>
Date: Monday, December 28, 2015 at 2:30 AM
To: zhangjp < 592426...@qq.com>
Cc: "user @spark" < user@spark.apache.org>
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance

Load csv file:
df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv", 
header = "true")
Calculate covariance:
cov <- cov(df, "col1", "col2")

Cheers
Yanbo


2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:
hi  all,
I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate  
covariance, how to do it .
thks.




?????? how to use sparkR or spark MLlib load csv file on hdfs thencalculate covariance

2015-12-28 Thread zhangjp
Now i have huge columns about 5k -20k, so if i want to Calculate covariance 
matrix ,which is the best method or common method ?

 

 --  --
  ??: "Felix Cheung";;
 : 2015??12??29??(??) 12:45
 ??: "Andy Davidson"; 
"zhangjp"<592426...@qq.com>; "Yanbo Liang"; 
 : "user"; 
 : Re: how to use sparkR or spark MLlib load csv file on hdfs thencalculate 
covariance

 

  Make sure you add the csv spark package as this example here so that the 
source parameter in R read.df would work:
 

 
https://spark.apache.org/docs/latest/sparkr.html#from-data-sources
 



 _
From: Andy Davidson 
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang 
Cc: user 


 Hi Yanbo 
 

 I use spark.csv to load my data set. I work with both Java and Python. I would 
recommend you print the first couple of rows and also print the schema to make 
sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like 
 

 

  
public class LoadTidyDataFrame {
 
static  DataFrame fromCSV(SQLContext sqlContext, String file) {
 
DataFrame df = sqlContext.read()
 
.format("com.databricks.spark.csv")
 
.option("inferSchema", "true")
 
.option("header", "true")
 
.load(file);
 

 
return df;
 
}
 
}

 

 

 

 From: Yanbo Liang < yblia...@gmail.com> 
Date: Monday, December 28, 2015 at 2:30 AM 
To: zhangjp < 592426...@qq.com> 
Cc: "user @spark" < user@spark.apache.org> 
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance 

 

  Load csv file:  df <- read.df(sqlContext, "file-path", source = 
"com.databricks.spark.csv", header = "true") 
 Calculate covariance: 
 cov <- cov(df, "col1", "col2") 
 

 Cheers 
 Yanbo 
 


 
 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>: 
  hi  all, 
 I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate 
 covariance, how to do it .  
 thks.

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Hyukjin Kwon
Hi Divya,

Are you using or have you tried Spark CSV datasource
https://github.com/databricks/spark-csv ?

Thanks!


2015-12-28 18:42 GMT+09:00 Divya Gehlot :

> Hi,
> I have input data set which is CSV file where I have date columns.
> My output will also be CSV file and will using this output CSV  file as
> for hive table creation.
> I have few queries :
> 1.I tried using custom schema using Timestamp but it is returning empty
> result set when querying the dataframes.
> 2.Can I use String datatype in Spark for date column and while creating
> table can define it as date type ? Partitioning of my hive table will be
> date column.
>
> Would really  appreciate if you share some sample code for timestamp in
> Dataframe whereas same can be used while creating the hive table.
>
>
>
> Thanks,
> Divya
>


Re: what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Hyukjin Kwon
Hi Andy,

This link explains the difference well.

https://bzhangusc.wordpress.com/2015/08/11/repartition-vs-coalesce/

Simply the difference is whether it "repartitions" partitions or not.

Actually coalesce() with suffering performs exactly woth repartition().
On 29 Dec 2015 08:10, "Andy Davidson"  wrote:

> Hi Michael
>
> I’ll try 1.6 and report back.
>
> The java doc does not say much about coalesce() or repartition(). When I
> use reparation() just before I save my output everything runs as expected
>
> I though coalesce() is an optimized version of reparation() and should be
> used when ever we know we are reducing the number of partitions.
>
> Kind regards
>
> Andy
>
> From: Michael Armbrust 
> Date: Monday, December 28, 2015 at 2:41 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: trouble understanding data frame memory usage
> ³java.io.IOException: Unable to acquire memory²
>
> Unfortunately in 1.5 we didn't force operators to spill when ran out of
> memory so there is not a lot you can do.  It would be awesome if you could
> test with 1.6 and see if things are any better?
>
> On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> I am using spark 1.5.1. I am running into some memory problems with a
>> java unit test. Yes I could fix it by setting –Xmx (its set to 1024M) how
>> ever I want to better understand what is going on so I can write better
>> code in the future. The test runs on a Mac, master="Local[2]"
>>
>> I have a java unit test that starts by reading a 672K ascii file. I my
>> output data file is 152K. Its seems strange that such a small amount of
>> data would cause an out of memory exception. I am running a pretty standard
>> machine learning process
>>
>>
>>1. Load data
>>2. create a ML pipeline
>>3. transform the data
>>4. Train a model
>>5. Make predictions
>>6. Join the predictions back to my original data set
>>7. Coalesce(1), I only have a small amount of data and want to save
>>it in a single file
>>8. Save final results back to disk
>>
>>
>> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
>> acquire memory”
>>
>> To try and figure out what is going I put log messages in to count the
>> number of partitions
>>
>> Turns out I have 20 input files, each one winds up in a separate
>> partition. Okay so after loading I call coalesce(1) and check to make sure
>> I only have a single partition.
>>
>> The total number of observations is 1998.
>>
>> After calling step 7 I count the number of partitions and discovered I
>> have 224 partitions!. Surprising given I called Coalesce(1) before I
>> did anything with the data. My data set should easily fit in memory. When I
>> save them to disk I get 202 files created with 162 of them being empty!
>>
>> In general I am not explicitly using cache.
>>
>> Some of the data frames get registered as tables. I find it easier to use
>> sql.
>>
>> Some of the data frames get converted back to RDDs. I find it easier to
>> create RDD this way
>>
>> I put calls to unpersist(true). In several places
>>
>>private void memoryCheck(String name) {
>>
>> Runtime rt = Runtime.getRuntime();
>>
>> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {}
>> df.size: {}",
>>
>> name,
>>
>> String.format("%,d", rt.totalMemory()),
>>
>> String.format("%,d", rt.freeMemory()));
>>
>> }
>>
>> Any idea how I can get a better understanding of what is going on? My
>> goal is to learn to write better spark code.
>>
>> Kind regards
>>
>> Andy
>>
>> Memory usages at various points in my unit test
>>
>> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>>
>> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:
>> 403,504,128
>>
>> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>>
>> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>>
>>
>>DataFrame exploreDF = results.select(results.col("id"),
>>
>> results.col("label"),
>>
>> results.col("binomialLabel"),
>>
>>
>> results.col("labelIndex"),
>>
>>
>> results.col("prediction"),
>>
>>
>> results.col("words"));
>>
>> exploreDF.show(10);
>>
>>
>>
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>>
>>
>> final String exploreTable = "exploreTable";
>>
>> exploreDF.registerTempTable(exploreTable);
>>
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>>
>> String stmt = String.format(fmt, exploreTable);
>>
>>
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>>
>>
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>>
>>
>> exploreDF.unpersist(true

Re: map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread SparkUser

Wouldn't Amazon Elastic IP do this for you?

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html

On 12/28/2015 10:58 PM, Divya Gehlot wrote:


Hi,

I have HDP2.3.2 cluster installed in Amazon EC2.

I want to update the IP adress of spark.driver.appUIAddress,which is 
currently mapped to private IP of EC2.


Searched in spark config in ambari,could find 
spark.driver.appUIAddress property.


Because of this private IP mapping,the spark webUI page is not getting 
displayed


Would really appreciate the help.

Thanks,

Divya




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



map spark.driver.appUIAddress IP to different IP

2015-12-28 Thread Divya Gehlot
Hi,

I have HDP2.3.2 cluster installed in Amazon EC2.

I want to update the IP adress of spark.driver.appUIAddress,which is
currently mapped to private IP of EC2.

Searched in spark config in ambari,could find spark.driver.appUIAddress
property.

Because of this private IP mapping,the spark webUI page is not getting
displayed

Would really appreciate the help.

Thanks,

Divya


Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Felix Cheung
Make sure you add the csv spark package as this example here so that the source 
parameter in R read.df would work:

https://spark.apache.org/docs/latest/sparkr.html#from-data-sources


_
From: Andy Davidson 
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang 
Cc: user 


   Hi Yanbo   
   I use spark.csv to load my data set. I work with both Java and Python. I 
would recommend you print the first couple of rows and also print the schema to 
make sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like   
   
   

public class LoadTidyDataFrame {

    static  DataFrame fromCSV(SQLContext sqlContext, String file) {

        DataFrame df = sqlContext.read()

                .format("com.databricks.spark.csv")

                .option("inferSchema", "true")

                .option("header", "true")

                .load(file);

        

        return df;

    }

}   
   
   
   From:  Yanbo Liang 
Date:  Monday, December 28, 2015 at 2:30 AM
To:  zhangjp <592426...@qq.com>
Cc:  "user @spark" 
Subject:  Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
  
   Load csv file:   df <- read.df(sqlContext, "file-path", 
source = "com.databricks.spark.csv", header = "true")Calculate 
covariance:cov <- cov(df, "col1", "col2")
CheersYanbo
  
   2015-12-28 17:21 GMT+08:00 zhangjp   <592426...@qq.com>:  
 hi  all,  I want  to use sparkR or 
spark MLlib  load csv file on hdfs then calculate  covariance, how to do it .   
 thks.   
 


  

Re: Spark submit does automatically upload the jar to cluster?

2015-12-28 Thread jiml
That's funny I didn't delete that answer!

I think I have two accounts crossing, here was the answer:

I don't know if this is going to help, but I agree that some of the docs
would lead one to believe that the Spark driver  or master is going to
spread your jars around for you. But there's other docs that seem to
contradict this, esp related to EC2 clusters.

I wrote a Stack Overflow answer dealing with a similar situation, see if it
helps:

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774

Pay attention to this section about the spark-submit docs:

I must admit, as a limitation on this, it confuses me in the Spark docs that
for spark.executor.extraClassPath it says:

Users typically should not need to set this option

I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it should like
the script handles moving your code around the cluster but I think it only
moves the classpath around for you. For example is this line from  Launching
Applications with spark-submit

  
explicitly says you have to move the jars yourself or make them "globally
available":

application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-does-automatically-upload-the-jar-to-cluster-tp25762p25826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-28 Thread jiml
Hi, a couple-three things. First, is this a Gradle project? SBT? Regardless
of the answer, convince yourself that you are getting this error from the
command line before doing anything else. Eclipse is awesome and it's also
really glitchy, I have seen too many times recently where something funky is
happening in Eclipse but I can go to the shell and "gradle build" and
"gradle run" things just fine.

Getting that out of the way, and I don't know yet how generally applicable
this idea is, get rid of ALL hostnames and try with just IP adresses. I
posted the results of some research I did this morning on SO:

http://stackoverflow.com/questions/28453835/apache-sparck-error-could-not-connect-to-akka-tcp-sparkmaster/34499020#34499020

Note that what I focus on is getting all spurious config out of the way.
Comment out all configs in spark-defaults.conf and sparv-env.sh that refer
to IP or Master config, just do only this: On the master, in spark-env.sh,
set the SPARK_MASTER_IP to the IP address, not hostname. Then use IP
addresses in your call to Spark Context. See what happens.

I know what you are seeing is two different bits of code working differently
but I would bet it's an underlying Spark config issue. The important part is
the master log which clearly identifies a network problem. As noted in my SO
post, there's a bug out there that leads me to always use IP addresses but I
am not sure how widely applicable that answer is :)

If that doesn't work, please post what is the different between "WordCount
MapReduce job"  and "Spark Wordcount" -- that's not clear to me. Post your
SparkConf and Spark Context calls.

JimL


   I'm new to Spark. Before I describe the problem, I'd like to let you know
the role of the machines that organize the cluster and the purpose of my
work. By reading and follwing the instructions and tutorials, I successfully
built up a cluster with 7 CentOS-6.5 machines. I installed Hadoop 2.7.1,
Spark 1.5.1, Scala 2.10.4 and ZooKeeper 3.4.5 on them. The details are
listed as below:


 As all the other guys in our group are in the habit of eclipse on Windows,
I'm trying to work on this. I have successfully submitted the WordCount
MapReduce job to YARN and it run smoothly through eclipse and Windows. But
when I tried to run the Spark WordCount, it gives me the following error in
the eclipse console:

...

15/12/23 11:15:33 ERROR ErrorMonitor: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://sparkMaster@10.20.17.70:7077/]] arriving at
[akka.tcp://sparkMaster@10.20.17.70:7077] inbound addresses are
[akka.tcp://sparkMaster@hadoop00:7077]
akka.event.Logging$Error$NoCause$
15/12/23 11:15:53 INFO Master: 10.20.6.23:56374 got disassociated, removing
it.
15/12/23 11:15:53 INFO Master: 10.20.6.23:56374 got disassociated, removing
it.
15/12/23 11:15:53 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@10.20.6.23:56374] has failed, address is now
gated for [5000] ms. Reason: [Disassociated] 
...

   object WordCount{
  def main(args: Array[String]){
val conf = new SparkConf().setAppName("Scala
WordCount").setMaster("spark://10.20.17.70:7077").setJars(List("C:\\Temp\\test.jar"));
val sc = new SparkContext(conf);
val textFile = sc.textFile("hdfs://10.20.17.70:9000/wc/indata/wht.txt");
textFile.flatMap(_.split(" ")).map((_,
1)).reduceByKey(_+_).collect().foreach(println);
  }
} 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-of-submitting-Spark-task-to-cluster-from-eclipse-IDE-on-Windows-tp25778p25825.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[Spakr1.4.1] StuctField for date column in CSV file while creating custom schema

2015-12-28 Thread Divya Gehlot
Hi,
I am newbee to Spark ,
My appologies for such a naive question
I am using Spark 1.4.1 and wrtiting code in scala . I have input data as
CSVfile  which I am parsing using spark-csv package . I am creating custom
schema to process the CSV file .
Now my query is which dataype or can say  Structfield should I use for Date
column of my CSV file.
I am using hivecontext and have requirement to create hive table after
processing the CSV file.
For example my date columnin CSV file  looks like

25/11/2014 20/9/2015 25/10/2015 31/10/2012 25/9/2013 25/11/2012 20/10/2013
25/10/2011


Re: Can't submit job to stand alone cluster

2015-12-28 Thread vivek.meghanathan
+ if exists whether it has read permission for the user who tries to run the 
job.

Regards
Vivek

On Tue, Dec 29, 2015 at 6:56 am, Ted Yu 
mailto:yuzhih...@gmail.com>> wrote:

Have you verified that the following file does exist ?

/home/hadoop/git/scalaspark/./target/scala-2.10/cluster-incidents_2.10-1.0.jar

Thanks

On Mon, Dec 28, 2015 at 3:16 PM, Daniel Valdivia 
mailto:h...@danielvaldivia.com>> wrote:
Hi,

I'm trying to submit a job to a small spark cluster running in stand alone 
mode, however it seems like the jar file I'm submitting to the cluster is "not 
found" by the workers nodes.

I might have understood wrong, but I though the Driver node would send this jar 
file to the worker nodes, or should I manually send this file to each worker 
node before I submit the job?

what I'm doing:

 $SPARK_HOME/bin/spark-submit --master spark://sslabnode01:6066 --deploy-mode 
cluster  --class ClusterIncidents 
./target/scala-2.10/cluster-incidents_2.10-1.0.jar

The error I'm getting:

Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/28 15:13:58 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: Submission successfully created as 
driver-20151228151359-0003. Polling submission state...
15/12/28 15:13:59 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20151228151359-0003 in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: State of driver 
driver-20151228151359-0003 is now ERROR.
15/12/28 15:13:59 INFO RestSubmissionClient: Driver is running on worker 
worker-20151218150246-10.15.235.241-52077 at 
10.15.235.241:52077.
15/12/28 15:13:59 ERROR RestSubmissionClient: Exception from the cluster:
java.io.FileNotFoundException: 
/home/hadoop/git/scalaspark/./target/scala-2.10/cluster-incidents_2.10-1.0.jar 
(No such file or directory)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.(FileInputStream.java:146)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
org.spark-project.guava.io.Files.copy(Files.java:436)

org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)

org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)

org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
15/12/28 15:13:59 INFO RestSubmissionClient: Server responded with 
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20151228151359-0003",
  "serverSparkVersion" : "1.5.2",
  "submissionId" : "driver-20151228151359-0003",
  "success" : true
}

Thanks in advance



-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org


The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Can't submit job to stand alone cluster

2015-12-28 Thread Ted Yu
Have you verified that the following file does exist ?

/home/hadoop/git/scalaspark/./target/scala-2.10/cluster-
incidents_2.10-1.0.jar

Thanks

On Mon, Dec 28, 2015 at 3:16 PM, Daniel Valdivia 
wrote:

> Hi,
>
> I'm trying to submit a job to a small spark cluster running in stand alone
> mode, however it seems like the jar file I'm submitting to the cluster is
> "not found" by the workers nodes.
>
> I might have understood wrong, but I though the Driver node would send
> this jar file to the worker nodes, or should I manually send this file to
> each worker node before I submit the job?
>
> what I'm doing:
>
>  $SPARK_HOME/bin/spark-submit --master spark://sslabnode01:6066
> --deploy-mode cluster  --class ClusterIncidents
> ./target/scala-2.10/cluster-incidents_2.10-1.0.jar
>
> The error I'm getting:
>
> Running Spark using the REST application submission protocol.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/12/28 15:13:58 INFO RestSubmissionClient: Submitting a request to
> launch an application in spark://sslabnode01:6066.
> 15/12/28 15:13:59 INFO RestSubmissionClient: Submission successfully
> created as driver-20151228151359-0003. Polling submission state...
> 15/12/28 15:13:59 INFO RestSubmissionClient: Submitting a request for the
> status of submission driver-20151228151359-0003 in spark://sslabnode01:6066.
> 15/12/28 15:13:59 INFO RestSubmissionClient: State of driver
> driver-20151228151359-0003 is now ERROR.
> 15/12/28 15:13:59 INFO RestSubmissionClient: Driver is running on worker
> worker-20151218150246-10.15.235.241-52077 at 10.15.235.241:52077.
> 15/12/28 15:13:59 ERROR RestSubmissionClient: Exception from the cluster:
> java.io.FileNotFoundException:
> /home/hadoop/git/scalaspark/./target/scala-2.10/cluster-incidents_2.10-1.0.jar
> (No such file or directory)
> java.io.FileInputStream.open(Native Method)
> java.io.FileInputStream.(FileInputStream.java:146)
>
> org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
>
> org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
> org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
> org.spark-project.guava.io.Files.copy(Files.java:436)
>
> org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
> org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
> org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
> org.apache.spark.deploy.worker.DriverRunner.org
> $apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
>
> org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
> 15/12/28 15:13:59 INFO RestSubmissionClient: Server responded with
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "message" : "Driver successfully submitted as
> driver-20151228151359-0003",
>   "serverSparkVersion" : "1.5.2",
>   "submissionId" : "driver-20151228151359-0003",
>   "success" : true
> }
>
> Thanks in advance
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SPARK_CLASSPATH out, spark.executor.extraClassPath in?

2015-12-28 Thread jiml
I looked into this a lot more and posted an answer to a similar question on
SO, but it's EC2 specific. Still might be some useful info in there and any
comments/corrections/improvements would be greatly appreciated!

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774
answer from today by me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-CLASSPATH-out-spark-executor-extraClassPath-in-tp25812p25823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Can't submit job to stand alone cluster

2015-12-28 Thread Daniel Valdivia
Hi,

I'm trying to submit a job to a small spark cluster running in stand alone 
mode, however it seems like the jar file I'm submitting to the cluster is "not 
found" by the workers nodes.

I might have understood wrong, but I though the Driver node would send this jar 
file to the worker nodes, or should I manually send this file to each worker 
node before I submit the job?

what I'm doing:

 $SPARK_HOME/bin/spark-submit --master spark://sslabnode01:6066 --deploy-mode 
cluster  --class ClusterIncidents 
./target/scala-2.10/cluster-incidents_2.10-1.0.jar 

The error I'm getting:

Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/28 15:13:58 INFO RestSubmissionClient: Submitting a request to launch an 
application in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: Submission successfully created as 
driver-20151228151359-0003. Polling submission state...
15/12/28 15:13:59 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20151228151359-0003 in spark://sslabnode01:6066.
15/12/28 15:13:59 INFO RestSubmissionClient: State of driver 
driver-20151228151359-0003 is now ERROR.
15/12/28 15:13:59 INFO RestSubmissionClient: Driver is running on worker 
worker-20151218150246-10.15.235.241-52077 at 10.15.235.241:52077.
15/12/28 15:13:59 ERROR RestSubmissionClient: Exception from the cluster:
java.io.FileNotFoundException: 
/home/hadoop/git/scalaspark/./target/scala-2.10/cluster-incidents_2.10-1.0.jar 
(No such file or directory)
java.io.FileInputStream.open(Native Method)
java.io.FileInputStream.(FileInputStream.java:146)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)

org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)
org.spark-project.guava.io.Files.copy(Files.java:436)

org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)
org.apache.spark.util.Utils$.copyFile(Utils.scala:485)
org.apache.spark.util.Utils$.doFetchFile(Utils.scala:562)
org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)

org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)

org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)
15/12/28 15:13:59 INFO RestSubmissionClient: Server responded with 
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20151228151359-0003",
  "serverSparkVersion" : "1.5.2",
  "submissionId" : "driver-20151228151359-0003",
  "success" : true
}

Thanks in advance



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
Hi Michael

I’ll try 1.6 and report back.

The java doc does not say much about coalesce() or repartition(). When I use
reparation() just before I save my output everything runs as expected

I though coalesce() is an optimized version of reparation() and should be
used when ever we know we are reducing the number of partitions.

Kind regards

Andy

From:  Michael Armbrust 
Date:  Monday, December 28, 2015 at 2:41 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble understanding data frame memory usage
³java.io.IOException: Unable to acquire memory²

> Unfortunately in 1.5 we didn't force operators to spill when ran out of memory
> so there is not a lot you can do.  It would be awesome if you could test with
> 1.6 and see if things are any better?
> 
> On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson 
> wrote:
>> I am using spark 1.5.1. I am running into some memory problems with a java
>> unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I
>> want to better understand what is going on so I can write better code in the
>> future. The test runs on a Mac, master="Local[2]"
>> 
>> I have a java unit test that starts by reading a 672K ascii file. I my output
>> data file is 152K. Its seems strange that such a small amount of data would
>> cause an out of memory exception. I am running a pretty standard machine
>> learning process
>> 
>> 1. Load data
>> 2. create a ML pipeline
>> 3. transform the data
>> 4. Train a model
>> 5. Make predictions
>> 6. Join the predictions back to my original data set
>> 7. Coalesce(1), I only have a small amount of data and want to save it in a
>> single file
>> 8. Save final results back to disk
>> 
>> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
>> acquire memory”
>> 
>> To try and figure out what is going I put log messages in to count the number
>> of partitions
>> 
>> Turns out I have 20 input files, each one winds up in a separate partition.
>> Okay so after loading I call coalesce(1) and check to make sure I only have a
>> single partition.
>> 
>> The total number of observations is 1998.
>> 
>> After calling step 7 I count the number of partitions and discovered I have
>> 224 partitions!. Surprising given I called Coalesce(1) before I did anything
>> with the data. My data set should easily fit in memory. When I save them to
>> disk I get 202 files created with 162 of them being empty!
>> 
>> In general I am not explicitly using cache.
>> 
>> Some of the data frames get registered as tables. I find it easier to use
>> sql.
>> 
>> Some of the data frames get converted back to RDDs. I find it easier to
>> create RDD this way
>> 
>> I put calls to unpersist(true). In several places
>> 
>>private void memoryCheck(String name) {
>> 
>> Runtime rt = Runtime.getRuntime();
>> 
>> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size:
>> {}", 
>> 
>> name,
>> 
>> String.format("%,d", rt.totalMemory()),
>> 
>> String.format("%,d", rt.freeMemory()));
>> 
>> }
>> 
>> 
>> Any idea how I can get a better understanding of what is going on? My goal is
>> to learn to write better spark code.
>> 
>> Kind regards
>> 
>> Andy
>> 
>> Memory usages at various points in my unit test
>> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>> 
>> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:   403,504,128
>> 
>> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>> 
>> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>> 
>> 
>> 
>>DataFrame exploreDF = results.select(results.col("id"),
>> 
>> results.col("label"),
>> 
>> results.col("binomialLabel"),
>> 
>> results.col("labelIndex"),
>> 
>> results.col("prediction"),
>> 
>> results.col("words"));
>> 
>> exploreDF.show(10);
>> 
>> 
>> 
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>> 
>> 
>> 
>> final String exploreTable = "exploreTable";
>> 
>> exploreDF.registerTempTable(exploreTable);
>> 
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>> 
>> String stmt = String.format(fmt, exploreTable);
>> 
>> 
>> 
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>> 
>> 
>> 
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>> 
>> 
>> 
>> exploreDF.unpersist(true); does not resolve memory issue
>> 
>> 
>> 
> 




Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Chris Fregly
the 200 number looks strangely similar to the following default number of
post-shuffle partitions which is often left untuned:

  spark.sql.shuffle.partitions

here's the property defined in the Spark source:

https://github.com/apache/spark/blob/834e71489bf560302f9d743dff669df1134e9b74/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L232

note that Spark 1.6+ will make this config param obsolete in favor of
adaptive execution which uses the following as a low watermark for # of
post-shuffle partitions:

spark.sql.adaptive.minNumPostShufflePartitions

here's the property defined in the Spark source:

https://github.com/apache/spark/blob/834e71489bf560302f9d743dff669df1134e9b74/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L245

On Mon, Dec 28, 2015 at 5:41 PM, Michael Armbrust 
wrote:

> Unfortunately in 1.5 we didn't force operators to spill when ran out of
> memory so there is not a lot you can do.  It would be awesome if you could
> test with 1.6 and see if things are any better?
>
> On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> I am using spark 1.5.1. I am running into some memory problems with a
>> java unit test. Yes I could fix it by setting –Xmx (its set to 1024M) how
>> ever I want to better understand what is going on so I can write better
>> code in the future. The test runs on a Mac, master="Local[2]"
>>
>> I have a java unit test that starts by reading a 672K ascii file. I my
>> output data file is 152K. Its seems strange that such a small amount of
>> data would cause an out of memory exception. I am running a pretty standard
>> machine learning process
>>
>>
>>1. Load data
>>2. create a ML pipeline
>>3. transform the data
>>4. Train a model
>>5. Make predictions
>>6. Join the predictions back to my original data set
>>7. Coalesce(1), I only have a small amount of data and want to save
>>it in a single file
>>8. Save final results back to disk
>>
>>
>> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
>> acquire memory”
>>
>> To try and figure out what is going I put log messages in to count the
>> number of partitions
>>
>> Turns out I have 20 input files, each one winds up in a separate
>> partition. Okay so after loading I call coalesce(1) and check to make sure
>> I only have a single partition.
>>
>> The total number of observations is 1998.
>>
>> After calling step 7 I count the number of partitions and discovered I
>> have 224 partitions!. Surprising given I called Coalesce(1) before I
>> did anything with the data. My data set should easily fit in memory. When I
>> save them to disk I get 202 files created with 162 of them being empty!
>>
>> In general I am not explicitly using cache.
>>
>> Some of the data frames get registered as tables. I find it easier to use
>> sql.
>>
>> Some of the data frames get converted back to RDDs. I find it easier to
>> create RDD this way
>>
>> I put calls to unpersist(true). In several places
>>
>>private void memoryCheck(String name) {
>>
>> Runtime rt = Runtime.getRuntime();
>>
>> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {}
>> df.size: {}",
>>
>> name,
>>
>> String.format("%,d", rt.totalMemory()),
>>
>> String.format("%,d", rt.freeMemory()));
>>
>> }
>>
>> Any idea how I can get a better understanding of what is going on? My
>> goal is to learn to write better spark code.
>>
>> Kind regards
>>
>> Andy
>>
>> Memory usages at various points in my unit test
>>
>> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>>
>> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:
>> 403,504,128
>>
>> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>>
>> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>>
>>
>>DataFrame exploreDF = results.select(results.col("id"),
>>
>> results.col("label"),
>>
>> results.col("binomialLabel"),
>>
>>
>> results.col("labelIndex"),
>>
>>
>> results.col("prediction"),
>>
>>
>> results.col("words"));
>>
>> exploreDF.show(10);
>>
>>
>>
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>>
>>
>> final String exploreTable = "exploreTable";
>>
>> exploreDF.registerTempTable(exploreTable);
>>
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>>
>> String stmt = String.format(fmt, exploreTable);
>>
>>
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>>
>>
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>>
>>
>> exploreDF.unpersist(true); does not resolve memory issue
>>
>>
>>
>


-- 

*Chris Fregly*
Principal Data Solutions

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Annabel Melongo
Additionally, if you already have some legal sql statements to process said 
data, instead of reinventing the wheel using rdd's functions, you can speed up 
implementation by using dataframes along with these existing sql statements. 

On Monday, December 28, 2015 5:37 PM, Darren Govoni  
wrote:
 

  I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case. 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Chris Fregly  
Date: 12/28/2015 5:22 PM (GMT-05:00) 
To: Richard Eggert  
Cc: Daniel Siegmann , Divya Gehlot 
, "user @spark"  
Subject: Re: DataFrame Vs RDDs ... Which one to use When ? 

here's a good article that sums it up, in my opinion: 
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive 
JVM bytecode.  haha.
@richard:  remember that even if you're currently writing RDDs in Java/Scala, 
you're not gaining the code gen/rewrite performance benefits of the Catalyst 
optimizer.
i agree with @daniel who suggested that you start with DataFrames and revert to 
RDDs only when DataFrames don't give you what you need.
the only time i use RDDs directly these days is when i'm dealing with a Spark 
library that has not yet moved to DataFrames - ie. GraphX - and it's kind of 
annoying switching back and forth.
almost everything you need should be in the DataFrame API.
Datasets are similar to RDDs, but give you strong compile-time typing, tabular 
structure, and Catalyst optimizations.
hopefully Datasets is the last API we see from Spark SQL...  i'm getting tired 
of re-writing slides and book chapters!  :)
On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert  
wrote:

One advantage of RDD's over DataFrames is that RDD's allow you to use your own 
data types, whereas DataFrames are backed by RDD's of Record objects, which are 
pretty flexible but don't give you much in the way of compile-time type 
checking. If you have an RDD of case class elements or JSON, then Spark SQL can 
automatically figure out how to convert it into an RDD of Record objects (and 
therefore a DataFrame), but there's no way to automatically go the other way 
(from DataFrame/Record back to custom types).
In general, you can ultimately do more with RDDs than DataFrames, but 
DataFrames give you a lot of niceties (automatic query optimization, table 
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime 
overhead associated with writing RDD code in a non-JVM language (such as Python 
or R), since the query optimizer is effectively creating the required JVM code 
under the hood. There's little to no performance benefit if you're already 
writing Java or Scala code, however (and RDD-based code may actually perform 
better in some cases, if you're willing to carefully tune your code).
On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann  
wrote:

DataFrames are a higher level API for working with tabular data - RDDs are used 
underneath. You can use either and easily convert between them in your code as 
necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier to 
code against them. Though if you're used to thinking in terms of collections 
rather than tables, you may find RDDs more natural. Data frames can also be 
faster, since Spark will do some optimizations under the hood - if you are 
using PySpark, this will avoid the overhead. Data frames may also perform 
better if you're reading structured data, such as a Hive table or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary (when 
you need to perform an operation not supported by data frames / Spark SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API - 
datasets. The release of Spark 1.6 is currently being finalized and I would 
expect it in the next few days. You will probably want to use the new API once 
it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot  wrote:

Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?

Would really appreciate the clarification .

Thanks,
Divya 






-- 
Rich



-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com

  

Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Michael Armbrust
Unfortunately in 1.5 we didn't force operators to spill when ran out of
memory so there is not a lot you can do.  It would be awesome if you could
test with 1.6 and see if things are any better?

On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> I am using spark 1.5.1. I am running into some memory problems with a java
> unit test. Yes I could fix it by setting –Xmx (its set to 1024M) how ever I
> want to better understand what is going on so I can write better code
> in the future. The test runs on a Mac, master="Local[2]"
>
> I have a java unit test that starts by reading a 672K ascii file. I my
> output data file is 152K. Its seems strange that such a small amount of
> data would cause an out of memory exception. I am running a pretty standard
> machine learning process
>
>
>1. Load data
>2. create a ML pipeline
>3. transform the data
>4. Train a model
>5. Make predictions
>6. Join the predictions back to my original data set
>7. Coalesce(1), I only have a small amount of data and want to save it
>in a single file
>8. Save final results back to disk
>
>
> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
> acquire memory”
>
> To try and figure out what is going I put log messages in to count the
> number of partitions
>
> Turns out I have 20 input files, each one winds up in a separate
> partition. Okay so after loading I call coalesce(1) and check to make sure
> I only have a single partition.
>
> The total number of observations is 1998.
>
> After calling step 7 I count the number of partitions and discovered I
> have 224 partitions!. Surprising given I called Coalesce(1) before I
> did anything with the data. My data set should easily fit in memory. When I
> save them to disk I get 202 files created with 162 of them being empty!
>
> In general I am not explicitly using cache.
>
> Some of the data frames get registered as tables. I find it easier to use
> sql.
>
> Some of the data frames get converted back to RDDs. I find it easier to
> create RDD this way
>
> I put calls to unpersist(true). In several places
>
>private void memoryCheck(String name) {
>
> Runtime rt = Runtime.getRuntime();
>
> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {}
> df.size: {}",
>
> name,
>
> String.format("%,d", rt.totalMemory()),
>
> String.format("%,d", rt.freeMemory()));
>
> }
>
> Any idea how I can get a better understanding of what is going on? My goal
> is to learn to write better spark code.
>
> Kind regards
>
> Andy
>
> Memory usages at various points in my unit test
>
> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>
> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:
> 403,504,128
>
> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>
> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>
>
>DataFrame exploreDF = results.select(results.col("id"),
>
> results.col("label"),
>
> results.col("binomialLabel"),
>
>
> results.col("labelIndex"),
>
>
> results.col("prediction"),
>
>
> results.col("words"));
>
> exploreDF.show(10);
>
>
>
> Yes I realize its strange to switch styles how ever this should not cause
> memory problems
>
>
> final String exploreTable = "exploreTable";
>
> exploreDF.registerTempTable(exploreTable);
>
> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>
> String stmt = String.format(fmt, exploreTable);
>
>
> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>
>
> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>
>
> exploreDF.unpersist(true); does not resolve memory issue
>
>
>


Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Darren Govoni


I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case. 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Chris Fregly  
Date: 12/28/2015  5:22 PM  (GMT-05:00) 
To: Richard Eggert  
Cc: Daniel Siegmann , Divya Gehlot 
, "user @spark"  
Subject: Re: DataFrame Vs RDDs ... Which one to use When ? 

here's a good article that sums it up, in my opinion: 
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive 
JVM bytecode.  haha.
@richard:  remember that even if you're currently writing RDDs in Java/Scala, 
you're not gaining the code gen/rewrite performance benefits of the Catalyst 
optimizer.
i agree with @daniel who suggested that you start with DataFrames and revert to 
RDDs only when DataFrames don't give you what you need.
the only time i use RDDs directly these days is when i'm dealing with a Spark 
library that has not yet moved to DataFrames - ie. GraphX - and it's kind of 
annoying switching back and forth.
almost everything you need should be in the DataFrame API.
Datasets are similar to RDDs, but give you strong compile-time typing, tabular 
structure, and Catalyst optimizations.
hopefully Datasets is the last API we see from Spark SQL...  i'm getting tired 
of re-writing slides and book chapters!  :)
On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert  
wrote:
One advantage of RDD's over DataFrames is that RDD's allow you to use your own 
data types, whereas DataFrames are backed by RDD's of Record objects, which are 
pretty flexible but don't give you much in the way of compile-time type 
checking. If you have an RDD of case class elements or JSON, then Spark SQL can 
automatically figure out how to convert it into an RDD of Record objects (and 
therefore a DataFrame), but there's no way to automatically go the other way 
(from DataFrame/Record back to custom types).
In general, you can ultimately do more with RDDs than DataFrames, but 
DataFrames give you a lot of niceties (automatic query optimization, table 
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime 
overhead associated with writing RDD code in a non-JVM language (such as Python 
or R), since the query optimizer is effectively creating the required JVM code 
under the hood. There's little to no performance benefit if you're already 
writing Java or Scala code, however (and RDD-based code may actually perform 
better in some cases, if you're willing to carefully tune your code).
On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann  
wrote:
DataFrames are a higher level API for working with tabular data - RDDs are used 
underneath. You can use either and easily convert between them in your code as 
necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier to 
code against them. Though if you're used to thinking in terms of collections 
rather than tables, you may find RDDs more natural. Data frames can also be 
faster, since Spark will do some optimizations under the hood - if you are 
using PySpark, this will avoid the overhead. Data frames may also perform 
better if you're reading structured data, such as a Hive table or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary (when 
you need to perform an operation not supported by data frames / Spark SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API - 
datasets. The release of Spark 1.6 is currently being finalized and I would 
expect it in the next few days. You will probably want to use the new API once 
it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot  wrote:
Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?

Would really appreciate the clarification .

Thanks,
Divya 






-- 
Rich




-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com



trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Andy Davidson
I am using spark 1.5.1. I am running into some memory problems with a java
unit test. Yes I could fix it by setting ­Xmx (its set to 1024M) how ever I
want to better understand what is going on so I can write better code in the
future. The test runs on a Mac, master="Local[2]"

I have a java unit test that starts by reading a 672K ascii file. I my
output data file is 152K. Its seems strange that such a small amount of data
would cause an out of memory exception. I am running a pretty standard
machine learning process

1. Load data
2. create a ML pipeline
3. transform the data
4. Train a model
5. Make predictions
6. Join the predictions back to my original data set
7. Coalesce(1), I only have a small amount of data and want to save it in a
single file
8. Save final results back to disk

Step 7: I am unable to call Coalesce() ³java.io.IOException: Unable to
acquire memory²

To try and figure out what is going I put log messages in to count the
number of partitions

Turns out I have 20 input files, each one winds up in a separate partition.
Okay so after loading I call coalesce(1) and check to make sure I only have
a single partition.

The total number of observations is 1998.

After calling step 7 I count the number of partitions and discovered I have
224 partitions!. Surprising given I called Coalesce(1) before I did anything
with the data. My data set should easily fit in memory. When I save them to
disk I get 202 files created with 162 of them being empty!

In general I am not explicitly using cache.

Some of the data frames get registered as tables. I find it easier to use
sql.

Some of the data frames get converted back to RDDs. I find it easier to
create RDD this way

I put calls to unpersist(true). In several places

   private void memoryCheck(String name) {

Runtime rt = Runtime.getRuntime();

logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {} df.size:
{}", 

name,

String.format("%,d", rt.totalMemory()),

String.format("%,d", rt.freeMemory()));

}


Any idea how I can get a better understanding of what is going on? My goal
is to learn to write better spark code.

Kind regards

Andy

Memory usages at various points in my unit test
name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184

name: naiveBayesModel totalMemory:   509,083,648 freeMemory:   403,504,128

name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104

name: results totalMemory:   509,083,648 freeMemory:   368,011,008



   DataFrame exploreDF = results.select(results.col("id"),

results.col("label"),

results.col("binomialLabel"),

results.col("labelIndex"),

results.col("prediction"),

results.col("words"));

exploreDF.show(10);



Yes I realize its strange to switch styles how ever this should not cause
memory problems



final String exploreTable = "exploreTable";

exploreDF.registerTempTable(exploreTable);

String fmt = "SELECT * FROM %s where binomialLabel = ¹signal'";

String stmt = String.format(fmt, exploreTable);



DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);



name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144



exploreDF.unpersist(true); does not resolve memory issue







Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
here's a good article that sums it up, in my opinion:

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

basically, building apps with RDDs is like building with apps with
primitive JVM bytecode.  haha.

@richard:  remember that even if you're currently writing RDDs in
Java/Scala, you're not gaining the code gen/rewrite performance benefits of
the Catalyst optimizer.

i agree with @daniel who suggested that you start with DataFrames and
revert to RDDs only when DataFrames don't give you what you need.

the only time i use RDDs directly these days is when i'm dealing with a
Spark library that has not yet moved to DataFrames - ie. GraphX - and it's
kind of annoying switching back and forth.

almost everything you need should be in the DataFrame API.

Datasets are similar to RDDs, but give you strong compile-time typing,
tabular structure, and Catalyst optimizations.

hopefully Datasets is the last API we see from Spark SQL...  i'm getting
tired of re-writing slides and book chapters!  :)

On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert 
wrote:

> One advantage of RDD's over DataFrames is that RDD's allow you to use your
> own data types, whereas DataFrames are backed by RDD's of Record objects,
> which are pretty flexible but don't give you much in the way of
> compile-time type checking. If you have an RDD of case class elements or
> JSON, then Spark SQL can automatically figure out how to convert it into an
> RDD of Record objects (and therefore a DataFrame), but there's no way to
> automatically go the other way (from DataFrame/Record back to custom types).
>
> In general, you can ultimately do more with RDDs than DataFrames, but
> DataFrames give you a lot of niceties (automatic query optimization, table
> joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime
> overhead associated with writing RDD code in a non-JVM language (such as
> Python or R), since the query optimizer is effectively creating the
> required JVM code under the hood. There's little to no performance benefit
> if you're already writing Java or Scala code, however (and RDD-based code
> may actually perform better in some cases, if you're willing to carefully
> tune your code).
>
> On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann <
> daniel.siegm...@teamaol.com> wrote:
>
>> DataFrames are a higher level API for working with tabular data - RDDs
>> are used underneath. You can use either and easily convert between them in
>> your code as necessary.
>>
>> DataFrames provide a nice abstraction for many cases, so it may be easier
>> to code against them. Though if you're used to thinking in terms of
>> collections rather than tables, you may find RDDs more natural. Data frames
>> can also be faster, since Spark will do some optimizations under the hood -
>> if you are using PySpark, this will avoid the overhead. Data frames may
>> also perform better if you're reading structured data, such as a Hive table
>> or Parquet files.
>>
>> I recommend you prefer data frames, switching over to RDDs as necessary
>> (when you need to perform an operation not supported by data frames / Spark
>> SQL).
>>
>> HOWEVER (and this is a big one), Spark 1.6 will have yet another API -
>> datasets. The release of Spark 1.6 is currently being finalized and I would
>> expect it in the next few days. You will probably want to use the new API
>> once it's available.
>>
>>
>> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot 
>> wrote:
>>
>>> Hi,
>>> I am new bee to spark and a bit confused about RDDs and DataFames in
>>> Spark.
>>> Can somebody explain me with the use cases which one to use when ?
>>>
>>> Would really appreciate the clarification .
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> Rich
>



-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Richard Eggert
One advantage of RDD's over DataFrames is that RDD's allow you to use your
own data types, whereas DataFrames are backed by RDD's of Record objects,
which are pretty flexible but don't give you much in the way of
compile-time type checking. If you have an RDD of case class elements or
JSON, then Spark SQL can automatically figure out how to convert it into an
RDD of Record objects (and therefore a DataFrame), but there's no way to
automatically go the other way (from DataFrame/Record back to custom types).

In general, you can ultimately do more with RDDs than DataFrames, but
DataFrames give you a lot of niceties (automatic query optimization, table
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime
overhead associated with writing RDD code in a non-JVM language (such as
Python or R), since the query optimizer is effectively creating the
required JVM code under the hood. There's little to no performance benefit
if you're already writing Java or Scala code, however (and RDD-based code
may actually perform better in some cases, if you're willing to carefully
tune your code).

On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann <
daniel.siegm...@teamaol.com> wrote:

> DataFrames are a higher level API for working with tabular data - RDDs are
> used underneath. You can use either and easily convert between them in your
> code as necessary.
>
> DataFrames provide a nice abstraction for many cases, so it may be easier
> to code against them. Though if you're used to thinking in terms of
> collections rather than tables, you may find RDDs more natural. Data frames
> can also be faster, since Spark will do some optimizations under the hood -
> if you are using PySpark, this will avoid the overhead. Data frames may
> also perform better if you're reading structured data, such as a Hive table
> or Parquet files.
>
> I recommend you prefer data frames, switching over to RDDs as necessary
> (when you need to perform an operation not supported by data frames / Spark
> SQL).
>
> HOWEVER (and this is a big one), Spark 1.6 will have yet another API -
> datasets. The release of Spark 1.6 is currently being finalized and I would
> expect it in the next few days. You will probably want to use the new API
> once it's available.
>
>
> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am new bee to spark and a bit confused about RDDs and DataFames in
>> Spark.
>> Can somebody explain me with the use cases which one to use when ?
>>
>> Would really appreciate the clarification .
>>
>> Thanks,
>> Divya
>>
>
>


-- 
Rich


Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are
used underneath. You can use either and easily convert between them in your
code as necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier
to code against them. Though if you're used to thinking in terms of
collections rather than tables, you may find RDDs more natural. Data frames
can also be faster, since Spark will do some optimizations under the hood -
if you are using PySpark, this will avoid the overhead. Data frames may
also perform better if you're reading structured data, such as a Hive table
or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary
(when you need to perform an operation not supported by data frames / Spark
SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API -
datasets. The release of Spark 1.6 is currently being finalized and I would
expect it in the next few days. You will probably want to use the new API
once it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot 
wrote:

> Hi,
> I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
> Can somebody explain me with the use cases which one to use when ?
>
> Would really appreciate the clarification .
>
> Thanks,
> Divya
>


Re: Passing parameters to spark SQL

2015-12-28 Thread Aaron Jackson
Yeah, that's what I thought.

In this specific case, I'm porting over some scripts from an existing RDBMS
platform.  I had been porting them (slowly) to in-code notation with python
or scala.  However, to expedite my efforts (and presumably theirs since I'm
not doing this forever), I went down the SQL path.  The problem is the loss
of type and the possibility for SQL injection. No biggie, just means that
where parameterized queries are in-play, we'll have to write it out in-code
rather than in SQL.

Thanks,

Aaron

On Sun, Dec 27, 2015 at 8:06 PM, Michael Armbrust 
wrote:

> The only way to do this for SQL is though the JDBC driver.
>
> However, you can use literal values without lossy/unsafe string
> conversions by using the DataFrame API.  For example, to filter:
>
> import org.apache.spark.sql.functions._
> df.filter($"columnName" === lit(value))
>
> On Sun, Dec 27, 2015 at 1:11 PM, Ajaxx  wrote:
>
>> Given a SQLContext (or HiveContext) is it possible to pass in parameters
>> to a
>> query.  There are several reasons why this makes sense, including loss of
>> data type during conversion to string, SQL injection, etc.
>>
>> But currently, it appears that SQLContext.sql() only takes a single
>> parameter which is a string.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Passing-parameters-to-spark-SQL-tp25806.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: partitioning json data in spark

2015-12-28 Thread Michael Armbrust
I don't think thats true (though if the docs are wrong we should fix
that).  In Spark 1.5 we converted JSON to go through the same code path as
parquet.

On Mon, Dec 28, 2015 at 12:20 AM, Նարեկ Գալստեան 
wrote:

> Well, I could try to do that,
> but *partitionBy *method is anyway only supported for Parquet format even
> in Spark 1.5.1
>
> Narek
>
> Narek Galstyan
>
> Նարեկ Գալստյան
>
> On 27 December 2015 at 21:50, Ted Yu  wrote:
>
>> Is upgrading to 1.5.x a possibility for you ?
>>
>> Cheers
>>
>> On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան 
>> wrote:
>>
>>>
>>> http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>>  I did try but it all was in vain.
>>> It is also explicitly written in api docs that it only supports Parquet.
>>>
>>> ​
>>>
>>> Narek Galstyan
>>>
>>> Նարեկ Գալստյան
>>>
>>> On 27 December 2015 at 17:52, Igor Berman  wrote:
>>>
 have you tried to specify format of your output, might be parquet is
 default format?
 df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path");

 On 27 December 2015 at 15:18, Նարեկ Գալստեան 
 wrote:

> Hey all!
> I am willing to partition *json *data by a column name and store the
> result as a collection of json files to be loaded to another database.
>
> I could use spark's built in *partitonBy *function but it only
> outputs in parquet format which is not desirable for me.
>
> Could you suggest me a way to deal with this problem?
> Narek Galstyan
>
> Նարեկ Գալստյան
>


>>>
>>
>


Re: ERROR server.TThreadPoolServer: Error occurred during processing of message

2015-12-28 Thread Dasun Hegoda
Anyone?

On Sun, Dec 27, 2015 at 11:30 AM, Dasun Hegoda 
wrote:

> I was able to figure out where the problem is exactly. It's spark. because
> when I start the hiveserver2 manually and run query it work fine. but when
> I try to access the hive through spark's thrift port it does not work.
> throws the above mentioned error.
>
> Please help me to fix this.
>
> On Sun, Dec 27, 2015 at 11:15 AM, Dasun Hegoda 
> wrote:
>
>> Yes, didn't work for me
>>
>>
>> On Sun, Dec 27, 2015 at 10:56 AM, Ted Yu  wrote:
>>
>>> Have you seen this ?
>>>
>>>
>>> http://stackoverflow.com/questions/30705576/python-cannot-connect-hiveserver2
>>>
>>> On Sat, Dec 26, 2015 at 9:09 PM, Dasun Hegoda 
>>> wrote:
>>>
 I'm running apache-hive-1.2.1-bin and spark-1.5.1-bin-hadoop2.6. spark
 as the hive engine. When I try to connect through JasperStudio using thrift
 port I get below error. I'm running ubuntu 14.04.

 15/12/26 23:36:20 ERROR server.TThreadPoolServer: Error occurred
 during processing of message.
 java.lang.RuntimeException:
 org.apache.thrift.transport.TSaslTransportException: No data or no sasl
 data in the stream
 at
 org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
 at
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:268)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.thrift.transport.TSaslTransportException: No
 data or no sasl data in the stream
 at
 org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:328)
 at
 org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
 at
 org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
 ... 4 more
 15/12/26 23:36:20 INFO thrift.ThriftCLIService: Client protocol
 version: HIVE_CLI_SERVICE_PROTOCOL_V5
 15/12/26 23:36:20 INFO session.SessionState: Created local
 directory: /tmp/c670ff55-01bb-4f6f-a375-d22a13c44eaf_resources
 15/12/26 23:36:20 INFO session.SessionState: Created HDFS
 directory: /tmp/hive/anonymous/c670ff55-01bb-4f6f-a375-d22a13c44eaf
 15/12/26 23:36:20 INFO session.SessionState: Created local
 directory: /tmp/hduser/c670ff55-01bb-4f6f-a375-d22a13c44eaf
 15/12/26 23:36:20 INFO session.SessionState: Created HDFS
 directory:
 /tmp/hive/anonymous/c670ff55-01bb-4f6f-a375-d22a13c44eaf/_tmp_space.db
 15/12/26 23:36:20 INFO thriftserver.SparkExecuteStatementOperation:
 Running query 'use default' with d842cd88-2fda-42b2-b943-468017e95f37
 15/12/26 23:36:20 INFO parse.ParseDriver: Parsing command: use
 default
 15/12/26 23:36:20 INFO parse.ParseDriver: Parse Completed
 15/12/26 23:36:20 INFO log.PerfLogger: >>> from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO log.PerfLogger: >>> from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO log.PerfLogger: >>> from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO log.PerfLogger: >>> from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO parse.ParseDriver: Parsing command: use
 default
 15/12/26 23:36:20 INFO parse.ParseDriver: Parse Completed
 15/12/26 23:36:20 INFO log.PerfLogger: >>> start=1451190980590 end=1451190980591 duration=1
 from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO log.PerfLogger: >>> method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO metastore.HiveMetaStore: 2: get_database:
 default
 15/12/26 23:36:20 INFO HiveMetaStore.audit: ugi=hduser
 ip=unknown-ip-addr cmd=get_database: default
 15/12/26 23:36:20 INFO metastore.HiveMetaStore: 2: Opening raw
 store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/12/26 23:36:20 INFO metastore.ObjectStore: ObjectStore,
 initialize called
 15/12/26 23:36:20 INFO DataNucleus.Query: Reading in results for
 query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the
 connection used is closing
 15/12/26 23:36:20 INFO metastore.MetaStoreDirectSql: Using direct
 SQL, underlying DB is DERBY
 15/12/26 23:36:20 INFO metastore.ObjectStore: Initialized
 ObjectStore
 15/12/26 23:36:20 INFO ql.Driver: Semantic Analysis Completed
 15/12/26 23:36:20 INFO log.PerfLogger: >>> method=semanticAnalyze start=1451190980592 end=1451190980620 duration=28
 from=org.apache.hadoop.hive.ql.Driver>
 15/12/26 23:36:20 INFO ql.Driver: Returning Hive schema:
 Schema(fieldSchemas:null, properties:

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Andy Davidson
Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I
would recommend you print the first couple of rows and also print the schema
to make sure your data is loaded as you expect. You might find the following
code example helpful. You may need to programmatically set the schema
depending on what you data looks like


public class LoadTidyDataFrame {

static  DataFrame fromCSV(SQLContext sqlContext, String file) {

DataFrame df = sqlContext.read()

.format("com.databricks.spark.csv")

.option("inferSchema", "true")

.option("header", "true")

.load(file);



return df;

}

}




From:  Yanbo Liang 
Date:  Monday, December 28, 2015 at 2:30 AM
To:  zhangjp <592426...@qq.com>
Cc:  "user @spark" 
Subject:  Re: how to use sparkR or spark MLlib load csv file on hdfs then
calculate covariance

> Load csv file:
> df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv",
> header = "true")
> Calculate covariance:
> cov <- cov(df, "col1", "col2")
> 
> Cheers
> Yanbo
> 
> 
> 2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:
>> hi  all,
>> I want  to use sparkR or spark MLlib  load csv file on hdfs then
>> calculate  covariance, how to do it .
>> thks.
> 




Re: Stuck with DataFrame df.select("select * from table");

2015-12-28 Thread Annabel Melongo
Jean,
Try this:df.select("""select * from tmptable where x1 = '3.0'""").show();
Note: you have to use 3 double quotes as marked  

On Friday, December 25, 2015 11:30 AM, Eugene Morozov 
 wrote:
 

 Thanks for the comments, although the issue is not in limit() predicate. It's 
something with spark being unable to resolve the expression.

I can do smth like this. It works as it suppose to:  
df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5);
But I think old fashioned sql style have to work also. I have 
df.registeredTempTable("tmptable") and then df.select("select * from tmptable 
where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot 
resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, 
x3, x2;
 at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca

>From the first statement I conclude that my custom datasource is perfectly 
>fine.Just wonder how to fix / workaround that. --
Be well!
Jean Morozov
On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman  wrote:

sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 
supported)

or use Dmitriy's solution. select() defines your projection when you've 
specified entire query
On 25 December 2015 at 15:42, Василец Дмитрий  wrote:

hello
you can try to use df.limit(5).show()
just trick :)

On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov  
wrote:

Hello, I'm basically stuck as I have no idea where to look;
Following simple code, given that my Datasource is working gives me an 
exception.DataFrame df = sqlc.load(filename, 
"com.epam.parso.spark.ds.DefaultSource");
df.cache();
df.printSchema();   <-- prints the schema perfectly fine!

df.show();  <-- Works perfectly fine (shows table with 20 
lines)!
df.registerTempTable("table");
df.select("select * from table limit 5").show(); <-- gives weird 
exceptionException is:AnalysisException: cannot resolve 'select * from table 
limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
I can do a collect on a dataframe, but cannot select any specific columns 
either "select * from table" or "select VER, CREATED from table".
I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5.
Thanks.
--
Be well!
Jean Morozov







  

Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
Thanks but I tried everything I want to confirm I am writing code below if
you can compile the following in Java with spark 1.5.2 then great otherwise
nothing is helpful here as I am stumbling with this since last few days.

public class PercentileHiveApproxTestMain {

public static void main(String[] args) {
SparkConf sparkconf = new
SparkConf().setAppName("PercentileHiveApproxTestMain").setMaster("local[*]");
SparkContext sc = new SparkContext(sparkconf);
SqlContext sqlContext = new SqlContext(sc);
//load two column data from csv and create dataframe with columns
C1(int),C0(string)
DataFrame df =
sqlContext.read().format("com.databricks.spark.csv").load("/tmp/df.csv");
df.select(callUdf("percentile_approx",col("C1"),lit(0.25))).show() //does
not compile
}

}

On Mon, Dec 28, 2015 at 9:56 PM, Hamel Kothari 
wrote:

> If you scroll further down in the documentation, you will see that callUDF
> does have a version which takes (String, Column...) as arguments: *callUDF
> *
> (java.lang.String udfName, Column
> 
> ... cols)
>
> Unfortunately the link I posted above doesn't seem to work because of the
> punctuation in the URL but it is there. If you use "callUdf" from Java with
> a string argument, which is what you seem to be doing, it expects a
> Seq because of the way it is defined in scala. That's also a
> deprecated method anyways.
>
> The reason you're getting the exception is not because that's the wrong
> method to call. It's because the percentile_approx UDF is never registered.
> If you're passing in a UDF by name, you must register it with your SQL
> context as follows (example taken from the documentation of the above
> referenced method):
>
>   import org.apache.spark.sql._
>
>   val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>   val sqlContext = df.sqlContext
>   sqlContext.udf.register("simpleUDF", (v: Int) => v * v)
>   df.select($"id", callUDF("simpleUDF", $"value"))
>
>
>
>
> On Mon, Dec 28, 2015 at 11:08 AM Umesh Kacha 
> wrote:
>
>> Hi thanks you understood question incorrectly. First of all I am passing
>> UDF name as String and if you see callUDF arguments then it does not take
>> string as first argument and if I use callUDF it will throw me exception
>> saying percentile_approx function not found. And another thing I mentioned
>> is that it works in Spark scala console so it does not have any problem of
>> calling it in not expected way. Hope now question is clear.
>>
>> On Mon, Dec 28, 2015 at 9:21 PM, Hamel Kothari 
>> wrote:
>>
>>> Also, if I'm reading correctly, it looks like you're calling "callUdf"
>>> when what you probably want is "callUDF" (notice the subtle capitalization
>>> difference). Docs:
>>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#callUDF(java.lang.String,%20org.apache.spark.sql.Column..
>>> .)
>>>
>>> On Mon, Dec 28, 2015 at 10:48 AM Hamel Kothari 
>>> wrote:
>>>
 Would you mind sharing more of your code? I can't really see the code
 that well from the attached screenshot but it appears that "Lit" is
 capitalized. Not sure what this method actually refers to but the
 definition in functions.scala is lowercased.

 Even if that's not it, some more code would be helpful to solving this.
 Also, since it's a compilation error, if you could share the compilation
 error that would be very useful.

 -Hamel

 On Mon, Dec 28, 2015 at 10:26 AM unk1102  wrote:

> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25821/Screen_Shot_2015-12-28_at_8.jpg
> >
>
> Hi I am trying to invoke Hive UDF using
> dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but
> it
> does not compile however same call works in Spark scala console I dont
> understand why. I am using Spark 1.5.2 maven source in my Java code. I
> have
> also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
> percentile_approx is located but still does not compile code please
> check
> attached code image. Please guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>


Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
If you scroll further down in the documentation, you will see that callUDF
does have a version which takes (String, Column...) as arguments: *callUDF
*
(java.lang.String udfName, Column

... cols)

Unfortunately the link I posted above doesn't seem to work because of the
punctuation in the URL but it is there. If you use "callUdf" from Java with
a string argument, which is what you seem to be doing, it expects a
Seq because of the way it is defined in scala. That's also a
deprecated method anyways.

The reason you're getting the exception is not because that's the wrong
method to call. It's because the percentile_approx UDF is never registered.
If you're passing in a UDF by name, you must register it with your SQL
context as follows (example taken from the documentation of the above
referenced method):

  import org.apache.spark.sql._

  val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
  val sqlContext = df.sqlContext
  sqlContext.udf.register("simpleUDF", (v: Int) => v * v)
  df.select($"id", callUDF("simpleUDF", $"value"))




On Mon, Dec 28, 2015 at 11:08 AM Umesh Kacha  wrote:

> Hi thanks you understood question incorrectly. First of all I am passing
> UDF name as String and if you see callUDF arguments then it does not take
> string as first argument and if I use callUDF it will throw me exception
> saying percentile_approx function not found. And another thing I mentioned
> is that it works in Spark scala console so it does not have any problem of
> calling it in not expected way. Hope now question is clear.
>
> On Mon, Dec 28, 2015 at 9:21 PM, Hamel Kothari 
> wrote:
>
>> Also, if I'm reading correctly, it looks like you're calling "callUdf"
>> when what you probably want is "callUDF" (notice the subtle capitalization
>> difference). Docs:
>> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#callUDF(java.lang.String,%20org.apache.spark.sql.Column..
>> .)
>>
>> On Mon, Dec 28, 2015 at 10:48 AM Hamel Kothari 
>> wrote:
>>
>>> Would you mind sharing more of your code? I can't really see the code
>>> that well from the attached screenshot but it appears that "Lit" is
>>> capitalized. Not sure what this method actually refers to but the
>>> definition in functions.scala is lowercased.
>>>
>>> Even if that's not it, some more code would be helpful to solving this.
>>> Also, since it's a compilation error, if you could share the compilation
>>> error that would be very useful.
>>>
>>> -Hamel
>>>
>>> On Mon, Dec 28, 2015 at 10:26 AM unk1102  wrote:
>>>
 <
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n25821/Screen_Shot_2015-12-28_at_8.jpg
 >

 Hi I am trying to invoke Hive UDF using
 dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but
 it
 does not compile however same call works in Spark scala console I dont
 understand why. I am using Spark 1.5.2 maven source in my Java code. I
 have
 also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
 percentile_approx is located but still does not compile code please
 check
 attached code image. Please guide. Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>


Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Umesh Kacha
Hi thanks you understood question incorrectly. First of all I am passing
UDF name as String and if you see callUDF arguments then it does not take
string as first argument and if I use callUDF it will throw me exception
saying percentile_approx function not found. And another thing I mentioned
is that it works in Spark scala console so it does not have any problem of
calling it in not expected way. Hope now question is clear.

On Mon, Dec 28, 2015 at 9:21 PM, Hamel Kothari 
wrote:

> Also, if I'm reading correctly, it looks like you're calling "callUdf"
> when what you probably want is "callUDF" (notice the subtle capitalization
> difference). Docs:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#callUDF(java.lang.String,%20org.apache.spark.sql.Column..
> .)
>
> On Mon, Dec 28, 2015 at 10:48 AM Hamel Kothari 
> wrote:
>
>> Would you mind sharing more of your code? I can't really see the code
>> that well from the attached screenshot but it appears that "Lit" is
>> capitalized. Not sure what this method actually refers to but the
>> definition in functions.scala is lowercased.
>>
>> Even if that's not it, some more code would be helpful to solving this.
>> Also, since it's a compilation error, if you could share the compilation
>> error that would be very useful.
>>
>> -Hamel
>>
>> On Mon, Dec 28, 2015 at 10:26 AM unk1102  wrote:
>>
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25821/Screen_Shot_2015-12-28_at_8.jpg
>>> >
>>>
>>> Hi I am trying to invoke Hive UDF using
>>> dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it
>>> does not compile however same call works in Spark scala console I dont
>>> understand why. I am using Spark 1.5.2 maven source in my Java code. I
>>> have
>>> also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
>>> percentile_approx is located but still does not compile code please check
>>> attached code image. Please guide. Thanks in advance.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>


Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
Also, if I'm reading correctly, it looks like you're calling "callUdf" when
what you probably want is "callUDF" (notice the subtle capitalization
difference). Docs:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#callUDF(java.lang.String,%20org.apache.spark.sql.Column..
.)

On Mon, Dec 28, 2015 at 10:48 AM Hamel Kothari 
wrote:

> Would you mind sharing more of your code? I can't really see the code that
> well from the attached screenshot but it appears that "Lit" is capitalized.
> Not sure what this method actually refers to but the definition in
> functions.scala is lowercased.
>
> Even if that's not it, some more code would be helpful to solving this.
> Also, since it's a compilation error, if you could share the compilation
> error that would be very useful.
>
> -Hamel
>
> On Mon, Dec 28, 2015 at 10:26 AM unk1102  wrote:
>
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25821/Screen_Shot_2015-12-28_at_8.jpg
>> >
>>
>> Hi I am trying to invoke Hive UDF using
>> dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it
>> does not compile however same call works in Spark scala console I dont
>> understand why. I am using Spark 1.5.2 maven source in my Java code. I
>> have
>> also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
>> percentile_approx is located but still does not compile code please check
>> attached code image. Please guide. Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Spark DataFrame callUdf does not compile?

2015-12-28 Thread Hamel Kothari
Would you mind sharing more of your code? I can't really see the code that
well from the attached screenshot but it appears that "Lit" is capitalized.
Not sure what this method actually refers to but the definition in
functions.scala is lowercased.

Even if that's not it, some more code would be helpful to solving this.
Also, since it's a compilation error, if you could share the compilation
error that would be very useful.

-Hamel

On Mon, Dec 28, 2015 at 10:26 AM unk1102  wrote:

> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25821/Screen_Shot_2015-12-28_at_8.jpg
> >
>
> Hi I am trying to invoke Hive UDF using
> dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it
> does not compile however same call works in Spark scala console I dont
> understand why. I am using Spark 1.5.2 maven source in my Java code. I have
> also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
> percentile_approx is located but still does not compile code please check
> attached code image. Please guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark DataFrame callUdf does not compile?

2015-12-28 Thread unk1102

 

Hi I am trying to invoke Hive UDF using
dataframe.select(callUdf("percentile_approx",col("C1"),lit(0.25))) but it
does not compile however same call works in Spark scala console I dont
understand why. I am using Spark 1.5.2 maven source in my Java code. I have
also explicitly added maven dependency hive-exec-1.2.1.spark.jar where
percentile_approx is located but still does not compile code please check
attached code image. Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrame-callUdf-does-not-compile-tp25821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Chris Fregly
which version of spark is this?

is there any chance that a single key - or set of keys- key has a large number 
of values relative to the other keys (aka. skew)?

if so, spark 1.5 *should* fix this issue with the new tungsten stuff, although 
I had some issues still with 1.5.1 in a similar situation.

I'm waiting to test with 1.6.0 before I start asking/creating jiras.

> On Dec 28, 2015, at 5:23 AM, Eugene Morozov  
> wrote:
> 
> Kendal, 
> 
> have you tried to reduce number of partitions?
> 
> --
> Be well!
> Jean Morozov
> 
>> On Mon, Dec 28, 2015 at 9:02 AM, kendal  wrote:
>> My driver is running OOM with my 4T data set... I don't collect any data to
>> driver. All what the program done is map - reduce - saveAsTextFile. But the
>> partitions to be shuffled is quite large - 20K+.
>> 
>> The symptom what I'm seeing the timeout when GetMapOutputStatuses from
>> Driver.
>> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Don't have map outputs
>> for shuffle 0, fetching them
>> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Doing the fetch;
>> tracker endpoint =
>> AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.115.58.55:52077/user/MapOutputTracker#-1937024516])
>> 15/12/24 02:06:21 WARN akka.AkkaRpcEndpointRef: Error sending message
>> [message = GetMapOutputStatuses(0)] in 1 attempts
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> at
>> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
>> at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
>> 
>> But the root cause is OOM:
>> 15/12/24 02:05:36 ERROR actor.ActorSystemImpl: Uncaught fatal error from
>> thread [sparkDriver-akka.remote.default-remote-dispatcher-24] shutting down
>> ActorSystem [sparkDriver]
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:2271)
>> at
>> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
>> at akka.serialization.JavaSerializer.toBinary(Serializer.scala:131)
>> at
>> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>> at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>> at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>> at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>> at
>> akka.remote.EndpointWriter$$anonfun$4.applyOrElse(Endpoint.scala:718)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>> 
>> I've already allocated 16G memory for my driver - which is the hard limit
>> MAX of my Yarn cluster. And I also applied Kryo serialization... Any idea to
>> reduce memory foot point?
>> And what confuses me is that, even I have 20K+ partition to shuffle, why I
>> need so much memory?!
>> 
>> Thank you so much for any help!
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Help-Driver-OOM-when-shuffle-large-amount-of-data-tp25818.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


FW: Problem About Worker System.out

2015-12-28 Thread David John




Thanks.
Can we use a slf4j/log4j logger to transfer our message from a worker  to a  
driver?I saw some discussions say that we can use this code to transfer their 
message:object Holder extends Serializable {  
   @transient lazy val log = Logger.getLogger(getClass.getName)
}


val someRdd = spark.parallelize(List(1, 2, 3)).foreach { element =>
   Holder.log.info(element)
}ref: 
http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala

Is this a traditional way?Or Spark has a SocketAppender for developer?Date: 
Mon, 28 Dec 2015 17:52:17 +0800
Subject: Re: Problem About Worker System.out
From: sai.sai.s...@gmail.com
To: david_john_2...@outlook.com
CC: user@spark.apache.org

Stdout will not be sent back to driver, no matter you use Scala or Java. You 
must do something wrongly that makes you think it is an expected behavior.
On Mon, Dec 28, 2015 at 5:33 PM, David John  wrote:



I have used  Spark 1.4  for 6 months.  Thanks  all the members of this 
community for your great work.I have a question  about the logging issue. I 
hope this question can be solved.
The program is running under this configurations: YARN Cluster, YARN-client 
mode.
In Scala,writing a code like:rdd.map( a => println(a) );   will get the output 
about the value of a in our console.
However,in Java (1.7),writing rdd.map( new Function(){ 
@Override public  Integer   call(Integer a) throws Exception {  
System.out.println(a); }});won't get the output in our console.
The configuration is the same.
I have try this code but not work either: rdd.map( new 
Function(){ @Override public  Integer   call(Integer 
a) throws Exception {org.apache.log4j.Logger log = 
Logger.getLogger(this.getClass()); log.info(a); 
log.warn(a); log.error(a); log.fatal(a); }});
No output either:final   org.apache.log4j.Logger log = 
Logger.getLogger(this.getClass()); rdd.map( new Function(){
 @Override public  Integer   call(Integer a) throws Exception { 
log.info(a); log.warn(a); log.error(a); 
log.fatal(a); }});
It seems that the output of stdout in worker doesn't send the output back to 
our driver.I am wonder why it works in scala but not in java.Is there a  simple 
way to make java work like scala?
Thanks. 
  


  

Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
If you run it as yarn-client mode, it will be client side log. If it is
yarn-cluster mode, it will be logged in the AM container (the first
container)


On Mon, Dec 28, 2015 at 8:30 PM, Alvaro Brandon 
wrote:

> Thanks for the swift response.
>
> I'm launching my applications through YARN. Where will these properties be
> logged?. I guess they wont be part of YARN logs
>
> 2015-12-28 13:22 GMT+01:00 Jeff Zhang :
>
>> set spark.logConf as true in spark-default.conf will log the property in
>> driver side. But it would only log the property you set, not including the
>> properties with default value.
>>
>>
>> On Mon, Dec 28, 2015 at 8:18 PM, alvarobrandon 
>> wrote:
>>
>>> Hello:
>>>
>>> I was wondering if its possible to log properties from Spark Applications
>>> like spark.yarn.am.memory, spark.driver.cores,
>>> spark.reducer.maxSizeInFlight
>>> without having to access the SparkConf object programmatically. I'm
>>> trying
>>> to find some kind of log file that has traces of the execution of Spark
>>> apps
>>> and its parameters.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anyway-to-log-properties-from-a-Spark-application-tp25820.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Alvaro Brandon
Thanks for the swift response.

I'm launching my applications through YARN. Where will these properties be
logged?. I guess they wont be part of YARN logs

2015-12-28 13:22 GMT+01:00 Jeff Zhang :

> set spark.logConf as true in spark-default.conf will log the property in
> driver side. But it would only log the property you set, not including the
> properties with default value.
>
>
> On Mon, Dec 28, 2015 at 8:18 PM, alvarobrandon 
> wrote:
>
>> Hello:
>>
>> I was wondering if its possible to log properties from Spark Applications
>> like spark.yarn.am.memory, spark.driver.cores,
>> spark.reducer.maxSizeInFlight
>> without having to access the SparkConf object programmatically. I'm trying
>> to find some kind of log file that has traces of the execution of Spark
>> apps
>> and its parameters.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anyway-to-log-properties-from-a-Spark-application-tp25820.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Is there anyway to log properties from a Spark application

2015-12-28 Thread Jeff Zhang
set spark.logConf as true in spark-default.conf will log the property in
driver side. But it would only log the property you set, not including the
properties with default value.


On Mon, Dec 28, 2015 at 8:18 PM, alvarobrandon 
wrote:

> Hello:
>
> I was wondering if its possible to log properties from Spark Applications
> like spark.yarn.am.memory, spark.driver.cores,
> spark.reducer.maxSizeInFlight
> without having to access the SparkConf object programmatically. I'm trying
> to find some kind of log file that has traces of the execution of Spark
> apps
> and its parameters.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anyway-to-log-properties-from-a-Spark-application-tp25820.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang


Is there anyway to log properties from a Spark application

2015-12-28 Thread alvarobrandon
Hello:

I was wondering if its possible to log properties from Spark Applications
like spark.yarn.am.memory, spark.driver.cores, spark.reducer.maxSizeInFlight
without having to access the SparkConf object programmatically. I'm trying
to find some kind of log file that has traces of the execution of Spark apps
and its parameters.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-anyway-to-log-properties-from-a-Spark-application-tp25820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Gaurav Kumar
Hi Ted,

I am using Spark 1.5.2

Without repartition in the picture, it works exactly as it's supposed to.
With repartition, I am guessing when we call takeOrdered on train, it goes
ahead and compute the rdd, which has repartitioning on it, and prints out
the numbers. With the next call to takeOrdered on test, it again computes
the rdd and again repartitions the data. Since repartitioning is not
guaranteed to produce the same result again, we see different numbers
because the rdd is effectively different now.

Moreover, if we cache the rdd after repartitioning it, both train and test
produces consistent results.



Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125

On Mon, Dec 28, 2015 at 3:04 PM, Ted Yu  wrote:

> bq. the train and test have overlap in the numbers being outputted
>
> Can the call to repartition explain the above ?
>
> Which release of Spark are you using ?
>
> Thanks
>
> On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar 
> wrote:
>
>> Hi,
>>
>> I noticed an inconsistent behavior when using rdd.randomSplit when the
>> source rdd is repartitioned, but only in YARN mode. It works fine in local
>> mode though.
>>
>> *Code:*
>> val rdd = sc.parallelize(1 to 100)
>> val rdd2 = rdd.repartition(64)
>> rdd.partitions.size
>> rdd2.partitions.size
>> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
>> train.takeOrdered(10)
>> test.takeOrdered(10)
>>
>> *Master: local*
>> Both the take statements produce consistent results and have no overlap
>> in numbers being outputted.
>>
>>
>> *Master: YARN*However, when these are run on YARN mode, these produce
>> random results every time and also the train and test have overlap in the
>> numbers being outputted.
>> If I use *rdd*.randomSplit, then it works fine even on YARN.
>>
>> So, it concludes that the repartition is being evaluated every time the
>> splitting occurs.
>>
>> Interestingly, if I cache the rdd2 before splitting it, then we can
>> expect consistent behavior since repartition is not evaluated again and
>> again.
>>
>> Best Regards,
>> Gaurav Kumar
>> Big Data • Data Science • Photography • Music
>> +91 9953294125
>>
>
>


Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Yanbo Liang
Load csv file:
df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv",
header = "true")
Calculate covariance:
cov <- cov(df, "col1", "col2")

Cheers
Yanbo


2015-12-28 17:21 GMT+08:00 zhangjp <592426...@qq.com>:

> hi  all,
> I want  to use sparkR or spark MLlib  load csv file on hdfs then
> calculate  covariance, how to do it .
> thks.
>


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-28 Thread Priya Ch
Chris, we are using spark 1.3.0 version. we have not set
spark.streaming.concurrentJobs
this parameter. It takes the default value.

Vijay,

  From the tack trace it is evident that
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
is throwing the exception. I opened the spark source code and visited the
line which is throwing this exception i.e

[image: Inline image 1]

The lie which is marked in red is throwing the exception. The file is
ExternalSorter.scala in org.apache.spark.util.collection package.

i went through the following blog
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
and understood that there is merge factor which decide the number of
on-disk files that could be merged. Is it some way related to this ?

Regards,
Padma CH

On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:

> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
> wrote:
>
>> Few indicators -
>>
>> 1) during execution time - check total number of open files using lsof
>> command. Need root permissions. If it is cluster not sure much !
>> 2) which exact line in the code is triggering this error ? Can you paste
>> that snippet ?
>>
>>
>> On Wednesday 23 December 2015, Priya Ch 
>> wrote:
>>
>>> ulimit -n 65000
>>>
>>> fs.file-max = 65000 ( in etc/sysctl.conf file)
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>>>
 Could you share the ulimit for your setup please ?

 - Thanks, via mobile,  excuse brevity.
 On Dec 22, 2015 6:39 PM, "Priya Ch" 
 wrote:

> Jakob,
>
>Increased the settings like fs.file-max in /etc/sysctl.conf and
> also increased user limit in /etc/security/limits.conf. But still see
> the same issue.
>
> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
> wrote:
>
>> It might be a good idea to see how many files are open and try
>> increasing the open file limit (this is done on an os level). In some
>> application use-cases it is actually a legitimate need.
>>
>> If that doesn't help, make sure you close any unused files and
>> streams in your code. It will also be easier to help diagnose the issue 
>> if
>> you send an error-reproducing snippet.
>>
>
>
>>>
>>
>> --
>> Regards,
>> Vijay Gharge
>>
>>
>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Re: Help: Driver OOM when shuffle large amount of data

2015-12-28 Thread Eugene Morozov
Kendal,

have you tried to reduce number of partitions?

--
Be well!
Jean Morozov

On Mon, Dec 28, 2015 at 9:02 AM, kendal  wrote:

> My driver is running OOM with my 4T data set... I don't collect any data to
> driver. All what the program done is map - reduce - saveAsTextFile. But the
> partitions to be shuffled is quite large - 20K+.
>
> The symptom what I'm seeing the timeout when GetMapOutputStatuses from
> Driver.
> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Don't have map outputs
> for shuffle 0, fetching them
> 15/12/24 02:04:21 INFO spark.MapOutputTrackerWorker: Doing the fetch;
> tracker endpoint =
> AkkaRpcEndpointRef(Actor[akka.tcp://
> sparkDriver@10.115.58.55:52077/user/MapOutputTracker#-1937024516])
> 15/12/24 02:06:21 WARN akka.AkkaRpcEndpointRef: Error sending message
> [message = GetMapOutputStatuses(0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
> at
>
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
> at
>
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
>
> But the root cause is OOM:
> 15/12/24 02:05:36 ERROR actor.ActorSystemImpl: Uncaught fatal error from
> thread [sparkDriver-akka.remote.default-remote-dispatcher-24] shutting down
> ActorSystem [sparkDriver]
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
> at akka.serialization.JavaSerializer.toBinary(Serializer.scala:131)
> at
> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
> at
>
> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
> at
>
> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
> at
> akka.remote.EndpointWriter$$anonfun$4.applyOrElse(Endpoint.scala:718)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> I've already allocated 16G memory for my driver - which is the hard limit
> MAX of my Yarn cluster. And I also applied Kryo serialization... Any idea
> to
> reduce memory foot point?
> And what confuses me is that, even I have 20K+ partition to shuffle, why I
> need so much memory?!
>
> Thank you so much for any help!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Help-Driver-OOM-when-shuffle-large-amount-of-data-tp25818.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Problem About Worker System.out

2015-12-28 Thread Saisai Shao
Stdout will not be sent back to driver, no matter you use Scala or Java.
You must do something wrongly that makes you think it is an expected
behavior.

On Mon, Dec 28, 2015 at 5:33 PM, David John 
wrote:

> I have used  Spark *1.4*  for 6 months.  Thanks  all the members of this
> community for your great work.
> I have a question  about the logging issue. I hope this question can be
> solved.
>
> The program is running under this configurations:
> YARN Cluster,
> YARN-client mode.
>
> In *Scala*,
> writing a code like:
> *rdd.map( a => println(a) );  *
>  will get the output about the value of a in our console.
>
> However,
> in *Java (1.7)*,
> writing
> rdd.map( new Function(){
>  @Override
>  public  Integer   call(Integer a) throws Exception {
>   *System.out.println(a);*
>  }
> });
> won't get the output in our console.
>
> The configuration is the same.
>
> I have try this code but not work either:
>  rdd.map( new Function(){
>  @Override
>  public  Integer   call(Integer a) throws Exception {
> org.apache.log4j.Logger log =
> Logger.getLogger(this.getClass());
>  log.info(a);
>  log.warn(a);
>  log.error(a);
>  log.fatal(a);
>  }
> });
>
> No output either:
> final   org.apache.log4j.Logger log = Logger.getLogger(this.getClass());
>  rdd.map( new Function(){
>  @Override
>  public  Integer   call(Integer a) throws Exception {
>  log.info(a);
>  log.warn(a);
>  log.error(a);
>  log.fatal(a);
>  }
> });
>
> It seems that the output of stdout in worker doesn't send the output back
> to our driver.
> I am wonder why it works in scala but not in java.
> Is there a  simple way to make java work like scala?
>
> Thanks.
>


Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
Hi,
I have input data set which is CSV file where I have date columns.
My output will also be CSV file and will using this output CSV  file as for
hive table creation.
I have few queries :
1.I tried using custom schema using Timestamp but it is returning empty
result set when querying the dataframes.
2.Can I use String datatype in Spark for date column and while creating
table can define it as date type ? Partitioning of my hive table will be
date column.

Would really  appreciate if you share some sample code for timestamp in
Dataframe whereas same can be used while creating the hive table.



Thanks,
Divya


Using Spark for high concurrent load tasks

2015-12-28 Thread Aliaksei Tsyvunchyk
Hello Spark community,

We have a project where we want to use Spark as computation engine to perform 
calculations and return result via REST services.
Working with Spark we have learned how to do things to make it work faster and 
finally optimize our code to produce results in acceptable time (1-2 seconds). 
But when we tried to test it under concurrent load we realizes that time grows 
significantly while increasing amount of concurrent requests. This was expected 
and we are trying to find the way to scale our solution to get acceptable time 
under concurrent load, and here we faced with fact that adding more slave 
servers not increasing average timing while having several concurrent requests.

As for now I observing following behavior: While hitting our test REST service 
using 100 threads and having 1 master and one slave node we have average timing 
for those 100 requests 30.6 seconds/request, in case we add 2 slave nodes 
average time becomes 29.8 seconds/request, which seems pretty similar to test 
case with one slave node.  While doing those test cases we monitor server load 
using htop and weird thing here that in first case our slave node’s CPU’s were 
loaded on 90-95% and in second case with 2 slaves it loads CPU’s on 45-50%.
We are trying to find bottleneck in our solution but was not succeed in this 
exercise yet. We have checked all hardware for possible bottleneck Network IO, 
Disk IO, RAM, CPU but no-one seems to be even closed to limit.

Our major suspects at this moment is Spark configuration. Our application is 
Self Contained Spark application, Spark is ruined in standalone mode (without 
external resource managers). We are not submitting jar to spark using shell 
script, instead we create spark context in our spring boot application and it 
connects to spark master and slave by itself. All requests to spark are going 
through internal thread pool. We have experimented with thread pool sizes and 
find out the best performance appears when we have 16 threads in thread pool, 
where each thread is performing one or several manipulations with RDDs, and by 
itself could be considered as spark job.

For now I assume we either misconfigured Spark due to lack of experience with 
it, or our use case is not really use case for Spark and it is simply not 
designed for big parallel load. I’ve make this conclusion since behavior when 
after adding new nodes we have no performance gain doesn’t makes any sense for 
me. So any help and ideas would be really helpful.


Hardware details:
Azure D5v2 instances for master and 3 slaves. D5v2 featured with 2.4 GHz 
E5-2673 v3 (16 cores, 56Gb RAM, SSD HDD). Using ipref we have tested network 
speed and it is around 1 Gb/s.



-- 


CONFIDENTIALITY NOTICE: This email and files attached to it are 
confidential. If you are not the intended recipient you are hereby notified 
that using, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. If you have received 
this email in error please notify the sender and delete this email.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Inconsistent behavior of randomSplit in YARN mode

2015-12-28 Thread Ted Yu
bq. the train and test have overlap in the numbers being outputted

Can the call to repartition explain the above ?

Which release of Spark are you using ?

Thanks

On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar 
wrote:

> Hi,
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the
> source rdd is repartitioned, but only in YARN mode. It works fine in local
> mode though.
>
> *Code:*
> val rdd = sc.parallelize(1 to 100)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
>
> *Master: local*
> Both the take statements produce consistent results and have no overlap in
> numbers being outputted.
>
>
> *Master: YARN*However, when these are run on YARN mode, these produce
> random results every time and also the train and test have overlap in the
> numbers being outputted.
> If I use *rdd*.randomSplit, then it works fine even on YARN.
>
> So, it concludes that the repartition is being evaluated every time the
> splitting occurs.
>
> Interestingly, if I cache the rdd2 before splitting it, then we can expect
> consistent behavior since repartition is not evaluated again and again.
>
> Best Regards,
> Gaurav Kumar
> Big Data • Data Science • Photography • Music
> +91 9953294125
>


Problem About Worker System.out

2015-12-28 Thread David John
I have used  Spark 1.4  for 6 months.  Thanks  all the members of this 
community for your great work.I have a question  about the logging issue. I 
hope this question can be solved.
The program is running under this configurations: YARN Cluster, YARN-client 
mode.
In Scala,writing a code like:rdd.map( a => println(a) );   will get the output 
about the value of a in our console.
However,in Java (1.7),writing rdd.map( new Function(){ 
@Override public  Integer   call(Integer a) throws Exception {  
System.out.println(a); }});won't get the output in our console.
The configuration is the same.
I have try this code but not work either: rdd.map( new 
Function(){ @Override public  Integer   call(Integer 
a) throws Exception {org.apache.log4j.Logger log = 
Logger.getLogger(this.getClass()); log.info(a); 
log.warn(a); log.error(a); log.fatal(a); }});
No output either:final   org.apache.log4j.Logger log = 
Logger.getLogger(this.getClass()); rdd.map( new Function(){
 @Override public  Integer   call(Integer a) throws Exception { 
log.info(a); log.warn(a); log.error(a); 
log.fatal(a); }});
It seems that the output of stdout in worker doesn't send the output back to 
our driver.I am wonder why it works in scala but not in java.Is there a  simple 
way to make java work like scala?
Thanks. 
  

how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread zhangjp
hi  all,
 I want  to use sparkR or spark MLlib  load csv file on hdfs then calculate 
 covariance, how to do it .  
 thks.

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
SQL context available as sqlContext.

>
> scala> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.hive.HiveContext
>
> scala> import org.apache.spark.sql.hive.orc._
> import org.apache.spark.sql.hive.orc._
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> 15/12/28 03:34:57 WARN SparkConf: The configuration key
> 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark
> 1.3 and and may be removed in the future. Please use the new key
> 'spark.yarn.am.waitTime' instead.
> 15/12/28 03:34:57 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> hiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@3413fbe
>
> scala> import org.apache.spark.sql.types.{StructType, StructField,
> StringType, IntegerType,FloatType ,LongType ,TimestampType,NullType };
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType, FloatType, LongType, TimestampType, NullType}
>
> scala> val loandepoSchema = StructType(Seq(
>  | StructField("COLUMN1", StringType, true),
>  | StructField("COLUMN2", StringType  , true),
>  | StructField("COLUMN3", TimestampType , true),
>  | StructField("COLUMN4", TimestampType , true),
>  | StructField("COLUMN5", StringType , true),
>  | StructField("COLUMN6", StringType, true),
>  | StructField("COLUMN7", IntegerType, true),
>  | StructField("COLUMN8", IntegerType, true),
>  | StructField("COLUMN9", StringType, true),
>  | StructField("COLUMN10", IntegerType, true),
>  | StructField("COLUMN11", IntegerType, true),
>  | StructField("COLUMN12", IntegerType, true),
>  | StructField("COLUMN13", StringType, true),
>  | StructField("COLUMN14", StringType, true),
>  | StructField("COLUMN15", StringType, true),
>  | StructField("COLUMN16", StringType, true),
>  | StructField("COLUMN17", StringType, true),
>  | StructField("COLUMN18", StringType, true),
>  | StructField("COLUMN19", StringType, true),
>  | StructField("COLUMN20", StringType, true),
>  | StructField("COLUMN21", StringType, true),
>  | StructField("COLUMN22", StringType, true)))
> loandepoSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(COLUMN1,StringType,true),
> StructField(COLUMN2,StringType,true),
> StructField(COLUMN3,TimestampType,true),
> StructField(COLUMN4,TimestampType,true),
> StructField(COLUMN5,StringType,true), StructField(COLUMN6,StringType,true),
> StructField(COLUMN7,IntegerType,true),
> StructField(COLUMN8,IntegerType,true),
> StructField(COLUMN9,StringType,true),
> StructField(COLUMN10,IntegerType,true),
> StructField(COLUMN11,IntegerType,true),
> StructField(COLUMN12,IntegerType,true),
> StructField(COLUMN13,StringType,true),
> StructField(COLUMN14,StringType,true),
> StructField(COLUMN15,StringType,true),
> StructField(COLUMN16,StringType,true),
> StructField(COLUMN17,StringType,true),
> StructField(COLUMN18,StringType,true), StructField(COLUMN19,Strin...
> scala> val lonadepodf =
> hiveContext.read.format("com.databricks.spark.csv").option("header",
> "true").schema(loandepoSchema).load("/tmp/TestDivya/loandepo_10K.csv")
> 15/12/28 03:37:52 INFO HiveContext: Initializing HiveMetastoreConnection
> version 0.13.1 using Spark classes.
> lonadepodf: org.apache.spark.sql.DataFrame = [COLUMN1: string, COLUMN2:
> string, COLUMN3: timestamp, COLUMN4: timestamp, COLUMN5: string, COLUMN6:
> string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int,
> COLUMN11: int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15:
> string, COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19:
> string, COLUMN20: string, COLUMN21: string, COLUMN22: string]
>
> scala> lonadepodf.select("COLUMN1").show(10)
> 15/12/28 03:38:01 INFO MemoryStore: ensureFreeSpace(216384) called with
> curMem=0, maxMem=278302556
> 15/12/28 03:38:01 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 211.3 KB, free 265.2 MB)
>
> ...
> 15/12/28 03:38:07 INFO DAGScheduler: ResultStage 2 (show at :33)
> finished in 0.653 s
> 15/12/28 03:38:07 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks
> have all completed, from pool
> 15/12/28 03:38:07 INFO DAGScheduler: Job 2 finished: show at :33,
> took 0.669388 s
> +---+
> |COLUMN1|
> +---+
> +---+
>
>
> scala> val loandepoSchema = StructType(Seq(
>  | StructField("COLUMN1", StringType, true),
>  | StructField("COLUMN2", StringType  , true),
>  | StructField("COLUMN3", StringType , true),
>  | StructField("COLUMN4", StringType , true),
>  | StructField("COLUMN5", StringType , true),
>  | StructField("COLUMN6", StringType, true),
>  | StructField("COLUMN7", StringType, true),
>  | StructField("COLUMN8", StringType, true),
>  | StructField("COLUMN9", StringType, true),
>  | StructField("COLUMN10

returns empty result set when using TimestampType and NullType as StructType +DataFrame +Scala + Spark 1.4.1

2015-12-28 Thread Divya Gehlot
>
> SQL context available as sqlContext.
>
> scala> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.hive.HiveContext
>
> scala> import org.apache.spark.sql.hive.orc._
> import org.apache.spark.sql.hive.orc._
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> 15/12/28 03:34:57 WARN SparkConf: The configuration key
> 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark
> 1.3 and and may be removed in the future. Please use the new key
> 'spark.yarn.am.waitTime' instead.
> 15/12/28 03:34:57 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> hiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@3413fbe
>
> scala> import org.apache.spark.sql.types.{StructType, StructField,
> StringType, IntegerType,FloatType ,LongType ,TimestampType,NullType };
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType, FloatType, LongType, TimestampType, NullType}
>
> scala> val loandepoSchema = StructType(Seq(
>  | StructField("COLUMN1", StringType, true),
>  | StructField("COLUMN2", StringType  , true),
>  | StructField("COLUMN3", TimestampType , true),
>  | StructField("COLUMN4", TimestampType , true),
>  | StructField("COLUMN5", StringType , true),
>  | StructField("COLUMN6", StringType, true),
>  | StructField("COLUMN7", IntegerType, true),
>  | StructField("COLUMN8", IntegerType, true),
>  | StructField("COLUMN9", StringType, true),
>  | StructField("COLUMN10", IntegerType, true),
>  | StructField("COLUMN11", IntegerType, true),
>  | StructField("COLUMN12", IntegerType, true),
>  | StructField("COLUMN13", StringType, true),
>  | StructField("COLUMN14", StringType, true),
>  | StructField("COLUMN15", StringType, true),
>  | StructField("COLUMN16", StringType, true),
>  | StructField("COLUMN17", StringType, true),
>  | StructField("COLUMN18", StringType, true),
>  | StructField("COLUMN19", StringType, true),
>  | StructField("COLUMN20", StringType, true),
>  | StructField("COLUMN21", StringType, true),
>  | StructField("COLUMN22", StringType, true)))
> loandepoSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(COLUMN1,StringType,true),
> StructField(COLUMN2,StringType,true),
> StructField(COLUMN3,TimestampType,true),
> StructField(COLUMN4,TimestampType,true),
> StructField(COLUMN5,StringType,true), StructField(COLUMN6,StringType,true),
> StructField(COLUMN7,IntegerType,true),
> StructField(COLUMN8,IntegerType,true),
> StructField(COLUMN9,StringType,true),
> StructField(COLUMN10,IntegerType,true),
> StructField(COLUMN11,IntegerType,true),
> StructField(COLUMN12,IntegerType,true),
> StructField(COLUMN13,StringType,true),
> StructField(COLUMN14,StringType,true),
> StructField(COLUMN15,StringType,true),
> StructField(COLUMN16,StringType,true),
> StructField(COLUMN17,StringType,true),
> StructField(COLUMN18,StringType,true), StructField(COLUMN19,Strin...
> scala> val lonadepodf =
> hiveContext.read.format("com.databricks.spark.csv").option("header",
> "true").schema(loandepoSchema).load("/tmp/TestDivya/loandepo_10K.csv")
> 15/12/28 03:37:52 INFO HiveContext: Initializing HiveMetastoreConnection
> version 0.13.1 using Spark classes.
> lonadepodf: org.apache.spark.sql.DataFrame = [COLUMN1: string, COLUMN2:
> string, COLUMN3: timestamp, COLUMN4: timestamp, COLUMN5: string, COLUMN6:
> string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int,
> COLUMN11: int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15:
> string, COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19:
> string, COLUMN20: string, COLUMN21: string, COLUMN22: string]
>
> scala> lonadepodf.select("COLUMN1").show(10)
> 15/12/28 03:38:01 INFO MemoryStore: ensureFreeSpace(216384) called with
> curMem=0, maxMem=278302556
> 15/12/28 03:38:01 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 211.3 KB, free 265.2 MB)
>
> ...
> 15/12/28 03:38:07 INFO DAGScheduler: ResultStage 2 (show at :33)
> finished in 0.653 s
> 15/12/28 03:38:07 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks
> have all completed, from pool
> 15/12/28 03:38:07 INFO DAGScheduler: Job 2 finished: show at :33,
> took 0.669388 s
> +---+
> |COLUMN1|
> +---+
> +---+
>
>
> scala> val loandepoSchema = StructType(Seq(
>  | StructField("COLUMN1", StringType, true),
>  | StructField("COLUMN2", StringType  , true),
>  | StructField("COLUMN3", StringType , true),
>  | StructField("COLUMN4", StringType , true),
>  | StructField("COLUMN5", StringType , true),
>  | StructField("COLUMN6", StringType, true),
>  | StructField("COLUMN7", StringType, true),
>  | StructField("COLUMN8", StringType, true),
>  | StructField("COLUMN9", StringType, true),
>  | StructField("COLUM

Re: partitioning json data in spark

2015-12-28 Thread Նարեկ Գալստեան
Well, I could try to do that,
but *partitionBy *method is anyway only supported for Parquet format even
in Spark 1.5.1

Narek

Narek Galstyan

Նարեկ Գալստյան

On 27 December 2015 at 21:50, Ted Yu  wrote:

> Is upgrading to 1.5.x a possibility for you ?
>
> Cheers
>
> On Sun, Dec 27, 2015 at 9:28 AM, Նարեկ Գալստեան 
> wrote:
>
>>
>> http://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>  I did try but it all was in vain.
>> It is also explicitly written in api docs that it only supports Parquet.
>>
>> ​
>>
>> Narek Galstyan
>>
>> Նարեկ Գալստյան
>>
>> On 27 December 2015 at 17:52, Igor Berman  wrote:
>>
>>> have you tried to specify format of your output, might be parquet is
>>> default format?
>>> df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path");
>>>
>>> On 27 December 2015 at 15:18, Նարեկ Գալստեան 
>>> wrote:
>>>
 Hey all!
 I am willing to partition *json *data by a column name and store the
 result as a collection of json files to be loaded to another database.

 I could use spark's built in *partitonBy *function but it only outputs
 in parquet format which is not desirable for me.

 Could you suggest me a way to deal with this problem?
 Narek Galstyan

 Նարեկ Գալստյան

>>>
>>>
>>
>