??????Cleaning spark memory

2016-06-10 Thread Ricky
wml iPhone -- -- ??: Takeshi Yamamuro : 2016611 12:15 ??: Cesar Flores : user : Re: Cleaning spark memory

Re: Cleaning spark memory

2016-06-10 Thread Takeshi Yamamuro
Hi, If you remove all cached data, please use `SQLContext#clearCache`. // maropu On Sat, Jun 11, 2016 at 3:18 AM, Cesar Flores wrote: > > Hello: > > Sometimes I cache data frames to memory that I forget to unpersist, losing > the variable reference in the process. > > Is

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Thanks for suggestion. Can you suggest me from where and how I how to start from the scratch to work on Spark. On Fri, Jun 10, 2016 at 8:18 PM, Holden Karau wrote: > So that's a bit complicated - you might want to start with reading the > code for the existing algorithms

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Mich, I completely agree with you. I built another Spark SQL application which reads data from MySQL and SQL server and writes the data into Hive(parquet+snappy format). I have this problem only when I read directly from remote SAS system. The interesting part is I am using same driver to read

Neither previous window has value for key, nor new values found

2016-06-10 Thread Marco Platania
Hi all,  I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The window interval is 2 hours, while the slide interval is 1 hour. I have a JavaPairRDD in which both keys and values are strings. Each time the reduceByKeyAndWindow() function is called, it uses

Slow collecting of large Spark Data Frames into R

2016-06-10 Thread Jonathan Mortensen
Hey Everyone! I've been converting between Parquet <-> Spark Data Frames <-> R Data Frames for larger data sets. I have found the conversion speed quite slow in the Spark <-> R side and am looking for some insight on how to speed it up (or determine what I have failed to do properly)! In R,

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
I personally use Scala to do something similar. For example here I extract data from an Oracle table and store in ORC table in Hive. This is compiled via sbt as run with SparkSubmit. It is similar to your code but in Scala. Note that I do not enclose my column names in double quotes. import

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Mich, Thanks for the response. If you look at my programs, I am not writings my queries to include column names in a pair of "". My driver in spark program is generating such query with column names in "" which I do not want. On the other hand, I am using the same driver in my pure Java

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
Assuming I understood your query, in Spark SQL (that is you log in to spark sql like spark-sql --master spark://:7077 you do not need double quotes around column names for sql to work spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1; hello from Mich Anything between a pair of

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi again, anyone in this group tried to access SAS dataset through Spark SQL ? Thank you Regards, Ajay On Friday, June 10, 2016, Ajay Chander wrote: > Hi Spark Users, > > I hope everyone here are doing great. > > I am trying to read data from SAS through Spark SQL and

Re: Saving Parquet files to S3

2016-06-10 Thread Bijay Kumar Pathak
Hi Ankur, I also tried setting a property to write parquet file size of 256MB. I am using pyspark below is how I set the property but it's not working for me. How did you set the property? spark_context._jsc.hadoopConfiguration().setInt( "dfs.blocksize", 268435456)

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi Gavin, for the first time someone is responding to this thread with a meaningful conversation - thanks for that. Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter and since the cached field was around 75 MB therefore I do not think that broadcast join was used. But I

Neither previous window has value for key, nor new values found.

2016-06-10 Thread Marco1982
Hi all, I'm running a Spark Streaming application that uses reduceByKeyAndWindow(). The window interval is 2 hours, while the slide interval is 1 hour. I have a JavaPairRDD in which both keys and values are strings. Each time the reduceByKeyAndWindow() function is called, it uses appendString()

Updated Spark logo

2016-06-10 Thread Matei Zaharia
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ to say "Apache Spark" instead of just "Spark". Many ASF projects have been doing this recently to make it clearer that they are associated with the ASF, and indeed the ASF's branding guidelines generally require

DataFrame.foreach(scala.Function1) example

2016-06-10 Thread Mohammad Tariq
Dear fellow spark users, Could someone please point me to any example showcasing the usage of *DataFrame.oreach(scala.Function1)* in *Java*? *Problem statement :* I am reading data from a Kafka topic, and for each RDD in the DStream I am creating a DataFrame in order to perform some operations.

Pls assist: Spark DecisionTree question

2016-06-10 Thread Marco Mistroni
HI all i am trying to run a ML program against some data, using DecisionTrees. To fine tune the parameters, i am running this loop to find the optimal values for impurity, depth and bins for (impurity <- Array("gini", "entropy"); depth<- Array(1,2,3, 4, 5); bins <-

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue
Yes. because in the second query, you did a (select PK from A) A . I guess it could the the subquery makes the results much smaller and make the broadcastJoin, so it is much faster. you could use sql.describe() to check the execution plan. On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta

Cleaning spark memory

2016-06-10 Thread Cesar Flores
Hello: Sometimes I cache data frames to memory that I forget to unpersist, losing the variable reference in the process. Is there a way of: (i) is there a way of recovering references to data frames that are still persisted in memory OR (ii) a way of just unpersist all spark cached variables?

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
Right without knowing what exactly the code it is difficult to say. Do you analyze the stuff from your Spark GUI? For example looking at the amount of spillage and spill size as the DAG diagram shows below? ​ After three days is a short period of time, so it is concerning! HTH P.S. What is

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread John Simon
Hi Mich, batch interval is 10 seconds, and I don't use sliding window. Typical message count per batch is ~100k. -- John Simon On Fri, Jun 10, 2016 at 10:31 AM, Mich Talebzadeh wrote: > Hi John, > > I did not notice anything unusual in your env variables. > >

Re: Error writing parquet to S3

2016-06-10 Thread Peter Halliday
Has anyone else seen this before? Before when I saw this there was an OOM but doesn’t seem so. Of course, I’m not sure how large the file that created this was either. Peter > On Jun 9, 2016, at 9:00 PM, Peter Halliday wrote: > > I’m not 100% sure why I’m getting

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
Hi John, I did not notice anything unusual in your env variables. However, what are the batch interval, the windowsLength and SlindingWindow interval. Also how many messages are sent by Kafka in a typical batch interval? HTH Dr Mich Talebzadeh LinkedIn *

Long Running Spark Streaming getting slower

2016-06-10 Thread john.simon
Hi all, I'm running Spark Streaming with Kafka Direct Stream, but after running a couple of days, the batch processing time almost doubles. I didn't find any slowdown on JVM GC logs, but I did find that Spark broadcast variable reading time increasing. Initially it takes less than 10ms, but after

SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Spark Users, I hope everyone here are doing great. I am trying to read data from SAS through Spark SQL and write into HDFS. Initially, I started with pure java program please find the program and logs in the attached file sas_pure_java.txt . My program ran successfully and it returned the

Re: word2vec: how to save an mllib model and reload it?

2016-06-10 Thread sharad82
I am having problem in serializing a ML word2vec model. Am I doing something wrong ? http://stackoverflow.com/questions/37723308/spark-ml-word2vec-serialization-issues -- View this message in context:

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
So that's a bit complicated - you might want to start with reading the code for the existing algorithms and go from there. If your goal is to contribute the algorithm to Spark you should probably take a look at the JIRA as well as the contributing to Spark guide on the wiki. Also we have a

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Kazuaki Ishizaki
Hi Yin Huai's slide is avaiable at http://www.slideshare.net/databricks/deep-dive-into-catalyst-apache-spark-20s-optimizer Kazuaki Ishizaki From: Takeshi Yamamuro To: Srinivasan Hariharan02 Cc: "user@spark.apache.org"

Java MongoDB Spark Stratio (Please give me a hint)

2016-06-10 Thread Umair Janjua
Hi my code, When i run this program is gets stuck at sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load(); line and then it does not proceed forward. Nothing happens after that. What should I do? How can I debug it. I am stuck here. Please any hint would be

RE: Saving Parquet files to S3

2016-06-10 Thread Ankur Jain
Thanks maropu.. It worked… From: Takeshi Yamamuro [mailto:linguin@gmail.com] Sent: 10 June 2016 11:47 AM To: Ankur Jain Cc: user@spark.apache.org Subject: Re: Saving Parquet files to S3 Hi, You'd better off `setting parquet.block.size`. // maropu On Thu, Jun 9, 2016 at 7:48 AM, Daniel

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, How to add new ML algo in Spark MLlib. On Fri, Jun 10, 2016 at 12:50 PM, Ram Krishna wrote: > Hi All, > > I am new to this this field, I want to implement new ML algo using Spark > MLlib. What is the procedure. > > -- > Regards, > Ram Krishna KT > > > > > >

RE: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Thanks Mich for your reply. I am curious to know one thing, Hive uses CBO which take into account of cpu cost, Does hive optimizer has any advantage over spark catalyst optimizer?. Regards, Srinivasan Hariharan +91-9940395830 From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent:

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Mich Talebzadeh
in an SMP system such as Oracle or Sybase the CBO will take into account LIO, PIO and CPU costing or use some empirical costing. In a distributed system like Spark with so many nodes that may not be that easy or its contribution to the Catalyst decision may be subject to variations that may not

Re: JavaDStream to Dataframe: Java

2016-06-10 Thread Alexander Krasheninnikov
Hello! While operating the JavaDStream you may use a transform() or foreach() methods, which give you an access to an RDD. JavaDStream dataFrameStream = ctx.textFileStream("source").transform(new Function2() { @Override public JavaRDD call(JavaRDD incomingRdd, Time

Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Asfandyar Ashraf Malik
Hi, I did not notice that I put it twice. I changed that and ran my program but it still gives the same error: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection$.typeOfObject()Lscala/PartialFunction; Cheers 2016-06-10 11:47 GMT+02:00 Alonso Isidoro Roman

RE: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Thanks Takeshi. Is there any reason for not using I/o cpu cost in catalyst optimizer?. Some sql engines which leverages Apache calcite has cost planner like volcanoPlanner which takes cpu and io cost for plan optimization. Regards, Srinivasan Hariharan +91-9940395830 From: Takeshi Yamamuro

Re: Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Alonso Isidoro Roman
why *spark-mongodb_2.11 dependency is written twice in pom.xml?* Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik

Spark Getting data from MongoDB in JAVA

2016-06-10 Thread Asfandyar Ashraf Malik
Hi, I am using Stratio library to get MongoDB to work with Spark but I get the following error: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.ScalaReflection This is my code. --- *public static

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Takeshi Yamamuro
Hi, There no way to retrieve that information in spark. In fact, the current optimizer only consider the byte size of outputs in LogicalPlan. Related code can be found in

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta
Hi, I think if we try to see why is Query 2 faster than Query 1 then all the answers will be given without beating around the bush. That is the right way to find out what is happening and why. Regards, Gourav On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue wrote: > Could

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
Hi Ram, Not super certain what you are looking to do. Are you looking to add a new algorithm to Spark MLlib for streaming or use Spark MLlib on streaming data? Cheers, Holden On Friday, June 10, 2016, Ram Krishna wrote: > Hi All, > > I am new to this this field, I

Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Ram Krishna
Hi All, I am new to this this field, I want to implement new ML algo using Spark MLlib. What is the procedure. -- Regards, Ram Krishna KT

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-10 Thread Bijay Pathak
Hello, Looks like you are hitting this: https://issues.apache.org/jira/browse/HIVE-11940. Thanks, Bijay On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh wrote: > cam you provide a code snippet of how you are populating the target table > from temp table. > > > HTH

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv
I'm using EC2 instances Thank you. Daniel > On 9 Jun 2016, at 16:49, Gourav Sengupta wrote: > > Hi, > > are you using EC2 instances or local cluster behind firewall. > > > Regards, > Gourav Sengupta > >> On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv >>

RE: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-10 Thread Ravi Aggarwal
Hi Ted, Thanks for the reply. Here is the code Btw – df.count is running fine on dataframe generated from this default source. I think it is something in the combination of join and hbase data source that is creating issue. But not sure entirely. I have also dumped the physical plans of both

[no subject]

2016-06-10 Thread pooja mehta
Hi, How to use scala UDF with the help of Beeline client. With the help of spark shell, we register our UDF like this:- sqlcontext.udf.register(). What is the way to use UDF in beeline client. Thanks Pooja

Catalyst optimizer cpu/Io cost

2016-06-10 Thread Srinivasan Hariharan02
Hi,, How can I get spark sql query cpu and Io cost after optimizing for the best logical plan. Is there any api to retrieve this information?. If anyone point me to the code where actually cpu and Io cost computed in catalyst module. Regards, Srinivasan Hariharan +91-9940395830

Re: Saving Parquet files to S3

2016-06-10 Thread Takeshi Yamamuro
Hi, You'd better off `setting parquet.block.size`. // maropu On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann wrote: > I don't believe there's anyway to output files of a specific size. What > you can do is partition your data into a number of partitions such that