Re: failure notice

2015-10-06 Thread Renyi Xiong
yes, it can recover on a different node. it uses write-ahead-log, checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or kafka), replies on the streaming engine's deterministic operations. by replaying back a certain range of data based on checkpointed ingress offset (at

FW: Spark error while running in spark mode

2015-10-06 Thread Ratika Prasad
From: Ratika Prasad Sent: Monday, October 05, 2015 2:39 PM To: u...@spark.apache.org Cc: Ameeta Jayarajan Subject: Spark error while running in spark mode Hi, When we run our spark component in cluster mode as below we get the following error ./bin/spark-submit

Pyspark dataframe read

2015-10-06 Thread Blaž Šnuderl
Hello everyone. It seems pyspark dataframe read is broken for reading multiple files. sql.read.json( "file1,file2") fails with java.io.IOException: No input paths specified in job. This used to work in spark 1.4 and also still work with sc.textFile Blaž

How can I access data on RDDs?

2015-10-06 Thread jatinganhotra
Consider the following 2 scenarios: *Scenario #1* val pagecounts = sc.textFile("data/pagecounts") pagecounts.checkpoint pagecounts.count *Scenario #2* val pagecounts = sc.textFile("data/pagecounts") pagecounts.count The total time show in the Spark shell Application UI was different for both

Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
Hi Spark Devs, So this has been brought up a few times before, and generally on the user list people get directed to use spark-testing-base. I'd like to start moving some of spark-testing-base's functionality into Spark so that people don't need a library to do what is (hopefully :p) a very

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i ran into the same thing in scala api. we depend heavily on comma separated paths, and it no longer works. On Tue, Oct 6, 2015 at 3:02 AM, Blaž Šnuderl wrote: > Hello everyone. > > It seems pyspark dataframe read is broken for reading multiple files. > > sql.read.json(

Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for file name, and as a result ... On Tuesday, October 6, 2015, Josh Rosen wrote: > Could someone please file a JIRA to track this? > https://issues.apache.org/jira/browse/SPARK > > On Tue, Oct 6, 2015 at

Re: Pyspark dataframe read

2015-10-06 Thread Koert Kuipers
i personally find the comma separated paths feature much more important than commas in paths (which one could argue you should avoid). but assuming people want to keep commas as legitimate characters in paths: https://issues.apache.org/jira/browse/SPARK-10185

Re: StructType has more rows, than corresponding Row has objects.

2015-10-06 Thread Eugene Morozov
Davies, that seemed to be my issue, my colleague helped me to resolved it. The problem was that we build RDD and corresponding StructType by ourselves (no json, parquet, cassandra, etc - we take a list of business objects and convert them to Rows, then infer struct type) and I missed one thing.

Re: Pyspark dataframe read

2015-10-06 Thread Josh Rosen
Could someone please file a JIRA to track this? https://issues.apache.org/jira/browse/SPARK On Tue, Oct 6, 2015 at 1:21 AM, Koert Kuipers wrote: > i ran into the same thing in scala api. we depend heavily on comma > separated paths, and it no longer works. > > > On Tue, Oct

Re: failure notice

2015-10-06 Thread Tathagata Das
Unfortunately, there is not an obvious way to do this. I am guessing that you want to partition your stream such that the same keys always go to the same executor, right? You could do it by writing a custom RDD. See ShuffledRDD

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future,

Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
To provide more context, if we do remove this feature, the following SQL query would throw an AnalysisException: select count(distinct colA), count(distinct colB) from foo; The following should still work: select count(distinct colA) from foo; The following should also work: select

Re: Adding Spark Testing functionality

2015-10-06 Thread Patrick Wendell
Hey Holden, It would be helpful if you could outline the set of features you'd imagine being part of Spark in a short doc. I didn't see a README on the existing repo, so it's hard to know exactly what is being proposed. As a general point of process, we've typically avoided merging modules into

Re: Adding Spark Testing functionality

2015-10-06 Thread Holden Karau
I'll put together a google doc and send that out (in the meantime a quick guide of sort of how the current package can be used is in the blog post I did at http://blog.cloudera.com/blog/2015/09/making-apache-spark-testing-easy-with-spark-testing-base/ ) If people think its better to keep as a

Re: SparkR dataframe UDF

2015-10-06 Thread Hossein
User defined functions written in R are not supposed yet. You can implement your UDF in Scala, register it in sqlContext and use it in SparkR, provided that you share your context between R and Scala. --Hossein On Friday, October 2, 2015, Renyi Xiong wrote: > Hi Shiva, >

Re: CQs on WindowedStream created on running StreamingContext

2015-10-06 Thread Yogesh Mahajan
Anyone knows about this ? TD ? -yogesh > On 30-Sep-2015, at 1:25 pm, Yogs wrote: > > Hi, > > We intend to run adhoc windowed continuous queries on spark streaming data. > The queries could be registered/deregistered dynamically or can be submitted > through