spark-sql estimates Cassandra table with 3 rows as 8 TB of data

2015-06-17 Thread Serega Sheypak
Hi, spark-sql estimated input for Cassandra table with 3 rows as 8 TB. sometimes it's estimated as -167B. I run it on laptop, I don't have 8 TB space for the data.

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Cheng Lian
Since SPARK-8406 is serious, we hope to ship it ASAP, possibly next week, but I can't say it's a promise yet. However, you can cherry pick the commit as soon as the fix is merged into branch-1.4. Sorry for the troubles! Cheng On 6/17/15 1:42 AM, Nathan McCarthy wrote: Thanks Cheng. Nice

Spark Shell Hive Context and Kerberos ticket

2015-06-17 Thread Olivier Girardot
Hi everyone, After copying the hive-site.xml from a CDH5 cluster, I can't seem to connect to the hive metastore using spark-shell, here's a part of the stack trace I get : 15/06/17 04:41:57 ERROR TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed

Re: Spark 1.4 DataFrame Parquet file writing - missing random rows/partitions

2015-06-17 Thread Nathan McCarthy
Thanks Cheng. Nice find! Let me know if there is anything we can do to help on this end with contributing a fix or testing. Side note - any ideas on the 1.4.1 eta? There are a few bug fixes we need in there. Cheers, Nathan From: Cheng Lian Date: Wednesday, 17 June 2015 6:25 pm To: Nathan,

Re: Can it works in load the MatrixFactorizationModel and predict product with Spark Streaming?

2015-06-17 Thread Yanbo Liang
The logs have told you what cause the error that you can not invoke RDD transformations and actions in other transformations. You have not do this explicitly but the implementation of MatrixFactorizationModel .recommendProducts do that, you can refer

Documentation for external shuffle service in 1.4.0

2015-06-17 Thread Tsai Li Ming
Hi, I can’t seem to find any documentation on this feature in 1.4.0? Regards, Liming - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Here's one simple spark example that I call RDD#count 2 times. The first time it would invoke 2 stages, but the second one only need 1 stage. Seems the first stage is cached. Is that true ? Any flag can I control whether the cache the intermediate stage val data = sc.parallelize(1 to 10,

Re: Spark or Storm

2015-06-17 Thread ayan guha
Great discussion!! One qs about some comment: Also, you can do some processing with Kinesis. If all you need to do is straight forward transformation and you are reading from Kinesis to begin with, it might be an easier option to just do the transformation in Kinesis - Do you mean KCL

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
In that case I assume you need exactly once semantics. There's no out-of-the-box way to do that in Spark. There is updateStateByKey, but it's not practical with your use case as the state is too large (it'll try to dump the entire intermediate state on every checkpoint, which would be

Twitter Heron: Stream Processing at Scale - Does Spark Address all the issues

2015-06-17 Thread Ashish Soni
Hi Sparkers , https://dl.acm.org/citation.cfm?id=2742788 Recently Twitter release a paper on Heron as an replacement of Apache Storm and i would like to know if currently Apache Spark Does Suffer from the same issues as they have outlined. Any input / thought will be helpful. Thanks, Ashish

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
Stream can also be processed in micro-batch / batches which is the main reason behind Spark Steaming so what is the difference ? Ashish On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote: PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread Eugen Cepoi
Cache is more general. ReduceByKey involves a shuffle step where the data will be in memory and on disk (for what doesn't hold in memory). The shuffle files will remain around until the end of the job. The blocks from memory will be dropped if memory is needed for other things. This is an

Re: Spark on EMR

2015-06-17 Thread Eugen Cepoi
It looks like it is a wrapper around https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark So basically adding an option -v,1.4.0.a should work. https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark-configure.html 2015-06-17 15:32 GMT+02:00 Hideyoshi Maeda

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote: In that case I assume you need exactly once semantics.

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's incrementing a counter:

Re: Spark on EMR

2015-06-17 Thread Hideyoshi Maeda
Any ideas what version of Spark is underneath? i.e. is it 1.4? and is SparkR supported on Amazon EMR? On Wed, Jun 17, 2015 at 12:06 AM, ayan guha guha.a...@gmail.com wrote: That's great news. Can I assume spark on EMR supports kinesis to hbase pipeline? On 17 Jun 2015 05:29, kamatsuoka

RE: Is HiveContext Thread Safe?

2015-06-17 Thread Cheng, Hao
Yes, it is thread safe. That’s how Spark SQL JDBC Server works. Cheng Hao From: V Dineshkumar [mailto:developer.dines...@gmail.com] Sent: Wednesday, June 17, 2015 9:44 PM To: user@spark.apache.org Subject: Is HiveContext Thread Safe? Hi, I have a HiveContext which I am using in multiple

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
PS just to elaborate on my first sentence, the reason Spark (not streaming) can offer exactly once semantics is because its update operation is idempotent. This is easy to do in a batch context because the input is finite, but it's harder in streaming context. On Wed, Jun 17, 2015 at 2:00 PM,

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Yes, actually on the storage ui, there's no data cached. But the behavior confuse me. If I call the cache method as following the behavior is the same as without calling cache method, why's that ? val data = sc.parallelize(1 to 10, 2).map(e=(e%2,2)).reduceByKey(_ + _, 2) data.cache()

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
Processing stuff in batch is not the same thing as being transactional. If you look at Storm, it will e.g. skip tuples that were already applied to a state to avoid counting stuff twice etc. Spark doesn't come with such facility, so you could end up counting twice etc. On Wed, Jun 17, 2015 at

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread ayan guha
Its not cached per se. For example, you will not see this in Storage tab in UI. However, spark has read the data and its in memory right now. So, the next count call should be very fast. Best Ayan On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse mark@d2l.com wrote: I think

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So as long as you use Spark streaming, you

Is HiveContext Thread Safe?

2015-06-17 Thread V Dineshkumar
Hi, I have a HiveContext which I am using in multiple threads to submit a Spark SQL query using *sql* method. I just wanted to know whether this method is thread-safe or not?Will all my queries be submitted at the same time independent of each other or will be submitted sequential one after the

Re: Spark on EMR

2015-06-17 Thread Kelly, Jonathan
Yes, for now it is a wrapper around the old install-spark BA, but that will change soon. The currently supported version in AMI 3.8.0 is 1.3.1, as 1.4.0 was released too late to include it in AMI 3.8.0. Spark 1.4.0 support is coming soon though, of course. Unfortunately, though install-spark is

Using spark.hadoop.* to set Hadoop properties

2015-06-17 Thread Corey Nolet
I've become accustomed to being able to use system properties to override properties in the Hadoop Configuration objects. I just recently noticed that when Spark creates the Hadoop Configuraiton in the SparkContext, it cycles through any properties prefixed with spark.hadoop. and add those

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Serega Sheypak
So, there is some input: So the problem could be in spark-sql-thriftserver. When I use spark console to submit SQL query, it takes 10 seconds and reasonable count of tasks. import com.datastax.spark.connector._; val cc = new CassandraSQLContext(sc); cc.sql(select su.user_id from

Re: Spark or Storm

2015-06-17 Thread ayan guha
Thanks for this. It's kcl based kinesis application. But because its just a Java application we are thinking to use spark on EMR or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis providing the coordination. My understanding is that it's like a naked Storm worker process that can consequently only do map. I haven't really used it tho, so can't

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
@Enno As per the latest version and documentation Spark Streaming does offer exactly once semantics using improved kafka integration , Not i have not tested yet. Any feedback will be helpful if anyone is tried the same. http://koeninger.github.io/kafka-exactly-once/#7

RE: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Cheng, Hao
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Since the logical tree is huge, it's reasonable to take long time in generating its tree string recursively. And I also doubt the computing can finish

Interpreting what gets printed as one submits spark application

2015-06-17 Thread shreesh
I am fairly new to spark. I configured 3 machines(2 slaves) on a standalone cluster. I just wanted to know what exactly is the meaning of: [Stage 0:==(25 + 4) / 500] This gets printed to the terminal when I submit my app. I understand

Read/write metrics for jobs which use S3

2015-06-17 Thread Abhishek Modi
I mostly use Amazon S3 for reading input data and writing output data for my spark jobs. I want to know the numbers of bytes read written by my job from S3. In hadoop, there are FileSystemCounters for this, is there something similar in spark ? If there is, can you please guide me on how to use

Shuffle produces one huge partition

2015-06-17 Thread Al M
I have 2 RDDs I want to Join. We will call them RDD A and RDD B. RDD A has 1 billion rows; RDD B has 100k rows. I want to join them on a single key. 95% of the rows in RDD A have the same key to join with RDD B. Before I can join the two RDDs, I must map them to tuples where the first element

Re: ClassNotFound exception from closure

2015-06-17 Thread Akhil Das
Not sure why spark-submit isn't shipping your project jar (may be try with --jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should solve it. Thanks Best Regards On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty

Re: Loading lots of parquet files into dataframe from s3

2015-06-17 Thread arnonrgo
What happens is that Spark opens the files so in order to merge the schema. Unfortunately spark has an assumption that the files are local so that access would be fast which makes this step in s3 extremely slow. If you know all the files use the same schema (e.g. it is a result of a previous job)

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics that refers to the latest version, it says: Semantics of output operations

Re: Unable to use more than 1 executor for spark streaming application with YARN

2015-06-17 Thread Saiph Kappa
How can I get more information regarding this exception? On Wed, Jun 17, 2015 at 1:17 AM, Saiph Kappa saiph.ka...@gmail.com wrote: Hi, I am running a simple spark streaming application on hadoop 2.7.0/YARN (master: yarn-client) with 2 executors in different machines. However, while the app

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Koert Kuipers
could it be composed maybe? a general version and then a sql version that exploits the additional info/abilities available there and uses the general version internally... i assume the sql version can benefit from the logical phase optimization to pick join details. or is there more? On Tue, Jun

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Other joins don’t suffer from this problem? Since the logical tree is huge, it's reasonable to take long time in generating its tree string

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: ALS predictALL not completing

2015-06-17 Thread Nick Pentreath
So to be clear, you're trying to use the recommendProducts method of MatrixFactorizationModel? I don't see predictAll in 1.3.1 1.4.0 has a more efficient method to recommend products for all users (or vice versa):

Re: SparkR 1.4.0: read.df() function fails

2015-06-17 Thread Stensrud, Erik
Thanks to both of you! You solved the problem. Thanks Erik Stensrud Sendt fra min iPhone Den 16. jun. 2015 kl. 20.23 skrev Guru Medasani gdm...@gmail.commailto:gdm...@gmail.com: Hi Esten, Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file path you specified is

<    1   2