Re: Unable to run Python Unit tests

2016-03-21 Thread Gayathri Murali
commit f3717fc7c97ea402c9ddf9020405070421eeb4a4 Thanks Gayathri On Mon, Mar 21, 2016 at 8:09 PM, Ted Yu wrote: > Can you tell us the commit hash your workspace is based on ? > > On Mon, Mar 21, 2016 at 8:05 PM, Gayathri Murali < > gayathri.m.sof...@gmail.com> wrote: > >>

Re: Unable to run Python Unit tests

2016-03-21 Thread Ted Yu
Can you tell us the commit hash your workspace is based on ? On Mon, Mar 21, 2016 at 8:05 PM, Gayathri Murali < gayathri.m.sof...@gmail.com> wrote: > Hi All, > > I am trying to run ./python/run-tests on my local master branch. I am > getting the following error. I have run this multiple times

Unable to run Python Unit tests

2016-03-21 Thread Gayathri Murali
Hi All, I am trying to run ./python/run-tests on my local master branch. I am getting the following error. I have run this multiple times before and never had this issue. Can someone please help? *Error: Could not find or load main class org.apache.spark.deploy.SparkSubmit* *ERROR*

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-21 Thread Silvio Fiorito
There’s a months_between function you could use, as well: df.filter(months_between(current_date, $”Payment Date”) > 6).show From: Mich Talebzadeh > Date: Monday, March 21, 2016 at 5:53 PM To: "user @spark"

Re: cluster randomly re-starting jobs

2016-03-21 Thread Roberto Pagliari
Yes you are right. The job failed and it was re-attempting. Thank you, From: Daniel Siegmann > Date: Monday, 21 March 2016 21:33 To: Ted Yu > Cc: Roberto Pagliari

pyspark sql convert long to timestamp?

2016-03-21 Thread Andy Davidson
Any idea how I have a col in a data frame that is of type long any idea how I create a column who¹s type is time stamp? The long is unix epoch in ms Thanks Andy

ALS setIntermediateRDDStorageLevel

2016-03-21 Thread Roberto Pagliari
According to this thread http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-question-td15420.html There should be a function to set intermediate storage level in ALS. However, I'm getting method not found with Spark 1.6. Is it still available? If so, can I get to see a minimal

Work out date column in CSV more than 6 months old (datediff or something)

2016-03-21 Thread Mich Talebzadeh
Hi, For test purposes I am reading in a simple csv file as follows: val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net:

Re: cluster randomly re-starting jobs

2016-03-21 Thread Daniel Siegmann
Never used Ambari and I don't know if this is your problem, but I have seen similar behavior. In my case, my application failed and Hadoop kicked off a second attempt. I didn't realize this, but when I refreshed the Spark UI, suddenly everything seemed reset! This is because the application ID is

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
The indexing I mentioned is more restrictive than that: each index corresponds to a unique position in a binary tree. (I.e., the first index of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC) You're correct that this restriction could be removed; with some careful

Re: sliding Top N window

2016-03-21 Thread Lars Albertsson
Hi, If you can accept approximate top N results, there is a neat solution for this problem: Use an approximate Map structure called Count-Min Sketch, in combination with a list of the M top items, where M > N. When you encounter an item not in the top M, you look up its count in the

Re: Can not kill driver properly

2016-03-21 Thread Shixiong(Ryan) Zhu
Could you post the log of Master? On Mon, Mar 21, 2016 at 9:25 AM, Hao Ren wrote: > Update: > > I am using --supervise flag for fault tolerance. > > > > On Mon, Mar 21, 2016 at 4:16 PM, Hao Ren wrote: > >> Using spark 1.6.1 >> Spark Streaming Jobs are

Re: Building spark submodule source code

2016-03-21 Thread Jakob Odersky
Another gotcha to watch out for are the SPARK_* environment variables. Have you exported SPARK_HOME? In that case, 'spark-shell' will use Spark from the variable, regardless of the place the script is called from. I.e. if SPARK_HOME points to a release version of Spark, your code changes will

Re: Limit pyspark.daemon threads

2016-03-21 Thread Carlile, Ken
No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a little more sense (at least it’s a multiple of 16), and it seems to be happening at reduce and collect portions of the code.  —Ken On Mar 17, 2016, at 10:51 AM,

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator

Re: SparkML algos limitations question.

2016-03-21 Thread Eugene Morozov
Hi, Joseph, I thought I understood, why it has a limit of 30 levels for decision tree, but now I'm not that sure. I thought that's because the decision tree stored in the array, which has length of type int, which cannot be more, than 2^31-1. But here are my new discoveries. I've trained two

Re: Error selecting from a Hive ORC table in Spark-sql

2016-03-21 Thread Mich Talebzadeh
sounds like with ORC transactional table this happens When I create that table as ORC but non transactional it works! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Find all invoices more than 6 months from csv file

2016-03-21 Thread Mich Talebzadeh
Hi, For test purposes I am ready a simple csv file as follows: val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net:

Re: Spark SQL Optimization

2016-03-21 Thread gtinside
More details : Execution plan for Original query select distinct pge.portfolio_code from table1 pge join table2 p on p.perm_group = pge.anc_port_group join table3 uge on p.user_group=uge.anc_user_group where uge.user_name = 'user' and p.perm_type = 'TEST' == Physical Plan ==

Re: Spark SQL Optimization

2016-03-21 Thread Xiao Li
Hi, Maybe you can open a JIRA and upload your plan as Michael suggested. This is an interesting feature. Thanks! Xiao Li 2016-03-21 10:36 GMT-07:00 Michael Armbrust : > It's helpful if you can include the output of EXPLAIN EXTENDED or > df.explain(true) whenever asking

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Michael Armbrust
> > But when tired using Spark streamng I could not find a way to store the > data with the avro schema information. The closest that I got was to create > a Dataframe using the json RDDs and store them as parquet. Here the parquet > files had a spark specific schema in their footer. > Does this

Re: Spark SQL Optimization

2016-03-21 Thread Michael Armbrust
It's helpful if you can include the output of EXPLAIN EXTENDED or df.explain(true) whenever asking about query performance. On Mon, Mar 21, 2016 at 6:27 AM, gtinside wrote: > Hi , > > I am trying to execute a simple query with join on 3 tables. When I look at > the execution

Error selecting from a Hive ORC table in Spark-sql

2016-03-21 Thread Mich Talebzadeh
Hi, Do we know the cause of this error when selecting from an Hive ORC table spark-sql> *select * from t2;*16/03/21 16:38:33 ERROR SparkSQLDriver: Failed in [select * from t2] java.lang.RuntimeException: serious problem at

Re: spark shuffle service on yarn

2016-03-21 Thread Marcelo Vanzin
If you use any shuffle service before 2.0 it should be compatible with all previous releases. The 2.0 version has currently an incompatibility that we should probably patch before releasing 2.0, to support this kind of use case (among others). On Fri, Mar 18, 2016 at 7:25 PM, Koert Kuipers

Re: Can not kill driver properly

2016-03-21 Thread Hao Ren
Update: I am using --supervise flag for fault tolerance. On Mon, Mar 21, 2016 at 4:16 PM, Hao Ren wrote: > Using spark 1.6.1 > Spark Streaming Jobs are submitted via spark-submit (cluster mode) > > I tried to kill drivers via webUI, it does not work. These drivers are >

Extending Spark Catalyst optimizer with own rules

2016-03-21 Thread tolyasik
I want to use Catalyst rules to transform star-scheme SQL query to SQL query to denormalized star-scheme where some fields from dimensions tables are represented in facts table. I tried to find some extension points to add own rules to make a transformation described above. But I didn't find any

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Ramkumar Venkataraman
Which is what surprises me as well. I am able to consistently reproduce this on my spark 1.5.2 - the same spark job crashes immediately without checkpointing, but when I enable it, the job continues inspite of the exceptions. On Mon, Mar 21, 2016 at 8:25 PM, Cody Koeninger

Re: Spark Metrics Framework?

2016-03-21 Thread Silvio Fiorito
You could use the metric sources and sinks described here: http://spark.apache.org/docs/latest/monitoring.html#metrics If you want to push the metrics to another system you can define a custom sink. You can also extend the metrics by defining a custom source. From: Mike Sukmanowsky

Spark Metrics Framework?

2016-03-21 Thread Mike Sukmanowsky
We make extensive use of the elasticsearch-hadoop library for Hadoop/Spark. In trying to troubleshoot our Spark applications, it'd be very handy to have access to some of the many metrics that the library makes available

HADOOP_HOME or hadoop.home.dir are not set

2016-03-21 Thread Hari Krishna Dara
I am using Spark 1.5.2 in yarn mode with Hadoop 2.6.0 (cdh5.4.2) and I am consistently seeing the below exception in the map container logs for Spark jobs (full stacktrace at the end of the message): java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set. at

Re: Zip File and XML parsing with Spark Streaming

2016-03-21 Thread tjb305
Many thanks I will try that and come back with my findings. Toby On 21 March 2016 at 03:15, firemonk91 [via Apache Spark User List] < ml-node+s1001560n26544...@n3.nabble.com> wrote: > You can write the incoming message to a temp location and use Java > ZipInputStream to unzip the file. You

Can not kill driver properly

2016-03-21 Thread Hao Ren
Using spark 1.6.1 Spark Streaming Jobs are submitted via spark-submit (cluster mode) I tried to kill drivers via webUI, it does not work. These drivers are still running. I also tried: 1. spark-submit --master --kill 2. ./bin/spark-class org.apache.spark.deploy.Client kill Neither works. The

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Cody Koeninger
Spark streaming in general will retry a batch N times then move on to the next one... off the top of my head, I'm not sure why checkpointing would have an effect on that. On Mon, Mar 21, 2016 at 3:25 AM, Ramkumar Venkataraman wrote: > Thanks Cody for the quick help. Yes,

Re: How to collect data for some particular point in spark streaming

2016-03-21 Thread Cody Koeninger
Kafka doesn't have an accurate time-based index. Your options are to maintain an index yourself, or start at a sufficiently early offset and filter messages. On Mon, Mar 21, 2016 at 7:28 AM, Nagu Kothapalli wrote: > Hi, > > > I Want to collect data from kafka ( json

Re: Saving model S3

2016-03-21 Thread Ted Yu
Please see this related thread: http://search-hadoop.com/m/q3RTtSYa3F1OT6H=DirectFileOutputCommiter On Mon, Mar 21, 2016 at 7:45 AM, Yasemin Kaya wrote: > Hi Ted, > > I don't understand the issue that you want to learn? Could you be more > clear please? > > > > > >

Re: Saving model S3

2016-03-21 Thread Yasemin Kaya
Hi Ted, I don't understand the issue that you want to learn? Could you be more clear please? 2016-03-21 15:24 GMT+02:00 Ted Yu : > Was speculative execution enabled ? > > Thanks > > On Mar 21, 2016, at 6:19 AM, Yasemin Kaya wrote: > > Hi, > > I am

Re: Setting up spark to run on two nodes

2016-03-21 Thread Luciano Resende
There is also sbin/star-all.sh and sbin/stop-all.sh which enables you to star/stop master and workers all together On Sunday, March 20, 2016, Akhil Das wrote: > You can simply execute the sbin/start-slaves.sh file to start up all slave > processes. Just make sure you

Re: Using lz4 in Kafka seems to be broken by jpountz dependency upgrade in Spark 1.5.x+

2016-03-21 Thread Marcin Kuthan
Hi Stefan Have you got any response from Spark team regarding LZ4 library compatibility? To avoid this kind of problems, lz4 should be shaded in Spark distribution, IMHO. Currently I'm not able to update Spark in my application due to this issue. It is not possible to consume compressed topics

Re: cluster randomly re-starting jobs

2016-03-21 Thread Ted Yu
Can you provide a bit more information ? Release of Spark and YARN Have you checked Spark UI / YARN job log to see if there is some clue ? Cheers On Mon, Mar 21, 2016 at 6:21 AM, Roberto Pagliari wrote: > I noticed that sometimes the spark cluster seems to restart

Spark SQL Optimization

2016-03-21 Thread gtinside
Hi , I am trying to execute a simple query with join on 3 tables. When I look at the execution plan , it varies with position of table in the "from" clause. Execution plan looks more optimized when the position of table with predicates is specified before any other table. Original query :

Re: Saving model S3

2016-03-21 Thread Ted Yu
Was speculative execution enabled ? Thanks > On Mar 21, 2016, at 6:19 AM, Yasemin Kaya wrote: > > Hi, > > I am using S3 read data also I want to save my model S3. In reading part > there is no error, but when I save model I am getting this error . I tried to > change the

cluster randomly re-starting jobs

2016-03-21 Thread Roberto Pagliari
I noticed that sometimes the spark cluster seems to restart the job completely. In the Ambari UI (where I can check jobs/stages) everything that was done up to a certain point is removed, and the job is restarted. Does anyone know what the issue could be? Thank you,

Saving model S3

2016-03-21 Thread Yasemin Kaya
Hi, I am using S3 read data also I want to save my model S3. In reading part there is no error, but when I save model I am getting this error . I tried to change the way from s3n to s3a but nothing change, different errors comes. *reading

How to collect data for some particular point in spark streaming

2016-03-21 Thread Nagu Kothapalli
Hi, I Want to collect data from kafka ( json Data , Ordered ) to particular time stamp . is there any way to do with spark streaming ? Please let me know.

RE: Run External R script from Spark

2016-03-21 Thread Sun, Rui
It’s a possible approach. It actually leverages Spark’s parallel execution. PipeRDD’s launching of external processes is just like that in pySpark and SparkR for RDD API. The concern is pipeRDD relies on text based serialization/deserialization. Whether the performance is acceptable actually

SparkSQL 2.0 snapshot - thrift server behavior

2016-03-21 Thread Raymond Honderdors
Hi, We were running with spark 1.6.x and using the "SHOW TABLES IN 'default'" command to read the list of tables. I have noticed that when I run the same on version 2.0.0 I get an empty result, but when I run "SHOW TABLES" I get the result I am after. Can we get the support back for the "SHOW

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Manivannan Selvadurai
Hi, Which version of spark are you using?? On Mon, Mar 21, 2016 at 12:28 PM, Sebastian Piu wrote: > We use this, but not sure how the schema is stored > > Job job = Job.getInstance(); > ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); >

java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
Hi, All I'm joining a small table (about 200m) with a huge table using broadcast join, however, spark throw the exception as follows: 16/03/20 22:32:06 WARN TransportChannelHandler: Exception in connection from java.lang.OutOfMemoryError: Direct buffer memory at

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2016-03-21 Thread Ramkumar Venkataraman
Thanks Cody for the quick help. Yes, the exception is happening in the executors during processing. I will look into cloning the KafkaRDD and swallowing the exception. But, something weird is happening: when I enable checkpointing on the job, my job doesn't crash, it happily proceeds with the

Issue with wholeTextFiles

2016-03-21 Thread Sarath Chandra
I'm using Hadoop 1.0.4 and Spark 1.2.0. I'm facing a strange issue. I have a requirement to read a small file from HDFS and all it's content has to be read at one shot. So I'm using spark context's wholeTextFiles API passing the HDFS URL for the file. When I try this from a spark shell it's

Re: declare constant as date

2016-03-21 Thread Divya Gehlot
Oh my my I am so silly I can declare it as string and cast it to date My apologies for Spamming the mailing list. Thanks, Divya On 21 March 2016 at 14:51, Divya Gehlot wrote: > Hi, > In Spark 1.5.2 > Do we have any utiility which converts a constant value as shown

Re: Best way to store Avro Objects as Parquet using SPARK

2016-03-21 Thread Sebastian Piu
We use this, but not sure how the schema is stored Job job = Job.getInstance(); ParquetOutputFormat.setWriteSupportClass(job, AvroWriteSupport.class); AvroParquetOutputFormat.setSchema(job, schema); LazyOutputFormat.setOutputFormatClass(job, new ParquetOutputFormat().getClass());

Re: Setting up spark to run on two nodes

2016-03-21 Thread Akhil Das
You can simply execute the sbin/start-slaves.sh file to start up all slave processes. Just make sure you have spark installed on the same path on all the machines. Thanks Best Regards On Sat, Mar 19, 2016 at 4:01 AM, Ashok Kumar wrote: > Experts. > > Please your

declare constant as date

2016-03-21 Thread Divya Gehlot
Hi, In Spark 1.5.2 Do we have any utiility which converts a constant value as shown below orcan we declare a date variable like val start_date :Date = "2015-03-02" val start_date = "2015-03-02" toDate like how we convert to toInt ,toString I searched for it but couldnt find it Thanks, Divya

Re: Potential conflict with org.iq80.snappy in Spark 1.6.0 environment?

2016-03-21 Thread Akhil Das
Looks like a jar conflict, could you paste the piece of code? and how your dependency file looks like? Thanks Best Regards On Sat, Mar 19, 2016 at 7:49 AM, vasu20 wrote: > Hi, > > I have some code that parses a snappy thrift file for objects. This code > works fine when run

How to name features and perform custom cross validation in ML

2016-03-21 Thread iguana314
Hello, I'm trying to a simple linear regression in Spark ML. Below is my Data Frame along with some sample code and output done via Spyder on a local spark cluster. *## #Begin Code ##* regressionDF.show(5) +---++ | label|features|

Re: Building spark submodule source code

2016-03-21 Thread Akhil Das
Have a look at the intellij setup https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Once you have the setup ready, you don't have to recompile the whole stuff every time. Thanks Best Regards On Mon, Mar 21, 2016 at 8:14 AM, Tenghuan He

Re: Error using collectAsMap() in scala

2016-03-21 Thread Akhil Das
What you should be doing is a join, something like this: //Create a key, value pair, key being the column1 val rdd1 = sc.textFile(file1).map(x => (x.split(",")(0),x.split(",")) //Create a key, value pair, key being the column2 val rdd2 = sc.textFile(file2).map(x => (x.split(",")(1),x.split(","))

Run External R script from Spark

2016-03-21 Thread sujeet jog
Hi, I have been working on a POC on some time series related stuff, i'm using python since i need spark streaming and sparkR is yet to have a spark streaming front end, couple of algorithms i want to use are not yet present in Spark-TS package, so I'm thinking of invoking a external R script for