Re: Alter table fails to find table

2015-09-04 Thread Akhil Das
Looks like a version mismatch, make sure you are having the same versions of jars everywhere (datanucleus jars etc). There was a similar discussion over here https://issues.apache.org/jira/browse/SPARK-4199 Thanks Best Regards On Thu, Sep 3, 2015 at 9:59 AM, Tim Smith wrote:

Python Spark Streaming example with textFileStream does not work. Why?

2015-09-04 Thread Kamilbek
I use spark 1.3.1 and Python 2.7 It is my first experience with Spark Streaming. I try example of code, which reads data from file using spark streaming. This is link to example: https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py My code is the

Re: Getting an error when trying to read a GZIPPED file

2015-09-04 Thread Akhil Das
Are you doing a .cache after the sc.textFile? If so, you can set the StorageLevel to MEMORY_AND_DISK to avoid that. Thanks Best Regards On Thu, Sep 3, 2015 at 10:11 AM, Spark Enthusiast wrote: > Folks, > > I have an input file which is gzipped. I use

Output files of saveAsText are getting stuck in temporary directory

2015-09-04 Thread Chirag Dewan
Hi, I have a 2 node Spark cluster and I am trying to read data from a Cassandra cluster and save the data as CSV file. Here is my code: JavaRDD mapPair = cachedRdd.map(new Function() { /**

Re: Parsing Avro from Kafka Message

2015-09-04 Thread Akhil Das
Something like this? val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](..) val avroData = avroStream.map(x => x._1.datum().toString) Thanks Best Regards On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv <

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
Tathagata, Checkpointing is turned on but we were not recovering. I'm looking at the logs now, feeding fresh content hours after the restart. Here's a snippet: 2015-09-04 06:11:20,013 ... Documents processed: 0. 2015-09-04 06:11:30,014 ... Documents processed: 0. 2015-09-04 06:11:40,011 ...

How do we get the Spark Streaming logs while it is active?

2015-09-04 Thread Uthayan Suthakar
Hello all, I'm using Yarn-cluster mode to run the Spark Streaming job, but I could only get the logs once the job is complete (manual intervention). But I would like to see the logs while it is running, is this possible?

Re: Output files of saveAsText are getting stuck in temporary directory

2015-09-04 Thread Sean Owen
That means the save has not finished yet. Are you sure it did? it writes in _temporary while it's in progress On Fri, Sep 4, 2015 at 10:10 AM, Chirag Dewan wrote: > Hi, > > > > I have a 2 node Spark cluster and I am trying to read data from a Cassandra > cluster and

RE: Output files of saveAsText are getting stuck in temporary directory

2015-09-04 Thread Chirag Dewan
Yes. The driver has successfully stopped. All the shutdown is succeeded without any errors in logs. I am using spark 1.4.1 with Cassandra 2.0.14. Chirag -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Friday, September 04, 2015 3:23 PM To: Chirag Dewan Cc:

Re: NOT IN in Spark SQL

2015-09-04 Thread Akhil Das
I think spark doesn't support NOT IN clauses, but you can do the same with a LEFT OUTER JOIN, Something like: SELECT A.id FROM A LEFT OUTER JOIN B ON (B.id = A.id) WHERE B.id IS null Thanks Best Regards On Thu, Sep 3, 2015 at 8:46 PM, Pietro Gentile < pietro.gentile89.develo...@gmail.com>

[spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-04 Thread ponkin
Hi, I am trying to read kafka topic with new directStream method in KafkaUtils. I have Kafka topic with 8 partitions. I am running streaming job on yarn with 8 execuors with 1 core for each one. So noticed that spark reads all topic's partitions in one executor sequentially - this is obviously

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
What version of Spark were you using? Have you tried increasing --executor-memory? This schema looks pretty normal. And Parquet stores all keys of a map in a single column. Cheng On 9/4/15 4:00 PM, Kohki Nishio wrote: The stack trace is this java.lang.OutOfMemoryError: Java heap space

Re: [spark-streaming] New directStream API reads topic's partitions sequentially. Why?

2015-09-04 Thread Cody Koeninger
The direct stream just makes a spark partition per kafka partition, so if those partitions are not getting evenly distributed among executors, something else is probably wrong with your configuration. If you replace the kafka stream with a dummy rdd created with e.g. sc.parallelize, what happens?

Re: Small File to HDFS

2015-09-04 Thread Tao Lu
Basically they need NOSQL like random update access. On Fri, Sep 4, 2015 at 9:56 AM, Ted Yu wrote: > What about concurrent access (read / update) to the small file with same > key ? > > That can get a bit tricky. > > On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke

SparkR / MLlib Integration

2015-09-04 Thread Jonathan Hodges
Hi Spark Experts, We are trying to streamline the development lifecycle of our data scientists taking algorithms from the lab into production. Currently the tool of choice for our data scientists is R. Historically our engineers have had to manually convert the R based algorithms to Java or

RE: NOT IN in Spark SQL

2015-09-04 Thread Ewan Leith
Spark SQL doesn’t support “NOT IN”, but I think HiveQL does, so give using the HiveContext a try rather than SQLContext. Here’s the spark 1.2 docs on it, but it’s basically identical to running the SQLContext https://spark.apache.org/docs/1.2.0/sql-programming-guide.html#tab_scala_6

Partitions with zero records & variable task times

2015-09-04 Thread mark
I am trying to tune a Spark job and have noticed some strange behavior - tasks in a stage vary in execution time, ranging from 2 seconds to 20 seconds. I assume tasks should all run in roughly the same amount of time in a well tuned job. So I did some investigation - the fast tasks appear to have

Re: Small File to HDFS

2015-09-04 Thread Ted Yu
What about concurrent access (read / update) to the small file with same key ? That can get a bit tricky. On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke wrote: > Well it is the same as in normal hdfs, delete file and put a new one with > the same name works. > > Le jeu. 3

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

2015-09-04 Thread unk1102
Hi I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine

Re: SparkR / MLlib Integration

2015-09-04 Thread Jörn Franke
You can always use the ml libs in R, but you have to integrate them in sparkr (= make all the logic to run in parallel etc). However, for your use case it may make more sense to write the wrapper R mllib yourself, if the project cannot provide it in time. It is not that difficult to call java or

Re: Small File to HDFS

2015-09-04 Thread Ted Yu
bq. Then use hbase +1 On Fri, Sep 4, 2015 at 9:00 AM, Jörn Franke wrote: > Then use hbase or similar. You originally wrote it was just for storing. > > Le ven. 4 sept. 2015 à 16:30, Tao Lu a écrit : > >> Basically they need NOSQL like random update

Re: repartition on direct kafka stream

2015-09-04 Thread Cody Koeninger
The answer already given is correct. You shouldn't doubt this, because you've already seen the shuffle data change accordingly. On Fri, Sep 4, 2015 at 11:25 AM, Shushant Arora wrote: > But Kafka stream has underlyng RDD which consists of offsets reanges only- > so

New to Spark - Paritioning Question

2015-09-04 Thread mmike87
Hello, I am new to Apache Spark and this is my company's first Spark project. Essentially, we are calculating models dealing with Mining data using Spark. I am holding all the source data in a persisted RDD that we will refresh periodically. When a "scenario" is passed to the Spark job (we're

ClassCastException in driver program

2015-09-04 Thread Jeff Jones
We are using Scala 2.11 for a driver program that is running Spark SQL queries in a standalone cluster. I’ve rebuilt Spark for Scala 2.11 using the instructions at http://spark.apache.org/docs/latest/building-spark.html. I’ve had to work through a few dependency conflict but all-in-all it

Re: Python Spark Streaming example with textFileStream does not work. Why?

2015-09-04 Thread Davies Liu
Spark Streaming only process the NEW files after it started, so you should point it to a directory, and copy the file into it after started. On Fri, Sep 4, 2015 at 5:15 AM, Kamilbek wrote: > I use spark 1.3.1 and Python 2.7 > > It is my first experience with Spark Streaming.

Re: Spark partitions from CassandraRDD

2015-09-04 Thread Ankur Srivastava
Oh if that is the case then you can try tuning " spark.cassandra.input.split.size" spark.cassandra.input.split.sizeapprox number of Cassandra partitions in a Spark partition 10 Hope this helps. Thanks Ankur On Thu, Sep 3, 2015 at 12:22 PM, Alaa Zubaidi (PDF)

Re: repartition on direct kafka stream

2015-09-04 Thread Shushant Arora
Yes agree shuffle data reveals that offsets+data is transformed. Wanted to understand mapPartition or any transformation in ( directKafkaStream.repartition(numexecutors).mapPartitions(...)) is happening before shuffle or after shuffle. If after shuffle - is this due to the reason that very

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Tathagata Das
Could you see what the streaming tab in the Spark UI says? It should show the underlying batch duration of the StreamingContext, the details of when the batch starts, etc. BTW, it seems that the 5.6 or 6.8 seconds delay is present only when data is present (that is, * Documents processed: > 0)*

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
Also, do you mean two partitions or two partition columns? If there are many partitions it can be much slower. In Spark 1.5 I'd consider setting spark.sql.hive.metastorePartitionPruning=true if you have predicates over the partition columns. On Fri, Sep 4, 2015 at 12:54 PM, Michael Armbrust

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
What format is this table. For parquet and other optimized formats we cache a bunch of file metadata on first access to make interactive queries faster. On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote: > Hello, > > I am using SparkSQL to query some Hive tables. Most of

Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan
Hi Michael, Thanks a lot for your reply. This table is stored as text file with tab delimited columns. You are correct, the problem is because my table has too many partitions (1825 in total). Since I am on Spark 1.4, I think I am hitting bug 6984

Spark on Yarn vs Standalone

2015-09-04 Thread Alexander Pivovarov
Hi Everyone We are trying the latest aws emr-4.0.0 and Spark and my question is about YARN vs Standalone mode. Our usecase is - start 100-150 nodes cluster every week, - run one heavy spark job (5-6 hours) - save data to s3 - stop cluster Officially aws emr-4.0.0 comes with Spark on Yarn It's

Re: DataFrame creation delay?

2015-09-04 Thread Michael Armbrust
If you run sqlContext.table("...").registerTempTable("...") that temptable will cache the lookup of partitions. On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan wrote: > Hi Michael, > > Thanks a lot for your reply. > > This table is stored as text file with tab delimited

Exception in saving MatrixFactorizationModel

2015-09-04 Thread Madawa Soysa
Hi All, I'm getting an error when trying to save a MatrixFactorizationModel. I'm using following method to save the model. model.save(sc, outPath) I'm getting the following exception when saving the model. I have attached the full stack trace. Any help would be appreciated to resolve this

Failing to include multiple JDBC drivers

2015-09-04 Thread Nicholas Connor
So, I need to connect to multiple databases to do cool stuff with Spark. To do this, I need multiple database drivers: Postgres + MySQL. *Problem*: Spark fails to run both drivers This method works for one driver at a time: spark-submit --driver-class-path="/driver.jar" These methods do

Is HDFS required for Spark streaming?

2015-09-04 Thread N B
Hello, We have a Spark Streaming program that is currently running on a single node in "local[n]" master mode. We currently give it local directories for Spark's own state management etc. The input is streaming from network/flume and output is also to network/kafka etc, so the process as such

Re: Small File to HDFS

2015-09-04 Thread Jörn Franke
Maybe you can tell us more about your use case, I have somehow the feeling that we are missing sth here Le jeu. 3 sept. 2015 à 15:54, Jörn Franke a écrit : > > Store them as hadoop archive (har) > > Le mer. 2 sept. 2015 à 18:07, a écrit : > >> Hello, >>

Help! Problem of UnsatisfiedLinkError with Spark JDBC JNI dynamic library

2015-09-04 Thread Jonathan Yue
Could anyone help me with Spark scala using JDBC to get data? The JDBC is based on our home-brewn C++ JNIlibrary.  I use sbt to compile and package the app into a jar and use spark-submit to submit the jar to standalone Spark.The problem is that it finally gives "Exception in thread "main"

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
Tathagata, In our logs I see the batch duration millis being set first to 10 then to 20 seconds. I don't see the 20 being reflected later during ingestion. In the Spark UI under Streaming I see the below output, notice the *10 second* Batch interval. Can you think of a reason why it's stuck at

Re: Parsing Avro from Kafka Message

2015-09-04 Thread Daniel Haviv
This will create and RDD[String] and what I want is a DF based on the avro schema. Thank you Akhil. Sent from my iPhone > On 4 בספט׳ 2015, at 15:05, Akhil Das wrote: > > Something like this? > > val avroStream =

Can we gracefully kill stragglers in Spark SQL

2015-09-04 Thread Jia Zhan
Hello all, I am new to Spark and have been working on a small project trying to tackle the straggler problems. I ran some SQL queries (GROUPBY) on a small cluster and observed that some tasks take several minutes while others finish in seconds. I know that Spark already has speculation mode but

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Tathagata Das
Are you sure you are not accidentally recovering from checkpoint? How are you using StreamingContext.getOrCreate() in your code? TD On Fri, Sep 4, 2015 at 4:53 PM, Dmitry Goldenberg wrote: > Tathagata, > > In our logs I see the batch duration millis being set first to

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
I'd think that we wouldn't be "accidentally recovering from checkpoint" hours or even days after consumers have been restarted, plus the content is the fresh content that I'm feeding, not some content that had been fed before the last restart. The code is basically as follows: SparkConf

Re: Batchdurationmillis seems "sticky" with direct Spark streaming

2015-09-04 Thread Dmitry Goldenberg
Sorry, more relevant code below: SparkConf sparkConf = createSparkConf(appName, kahunaEnv); JavaStreamingContext jssc = params.isCheckpointed() ? createCheckpointedContext(sparkConf, params) : createContext(sparkConf, params); jssc.start(); jssc.awaitTermination(); jssc.close(); ………..

What happens to this RDD? OutOfMemoryError

2015-09-04 Thread Kevin Mandich
Hi All, I'm using PySpark to create a corpus of labeled data points. I create an RDD called corpus, and then join to this RDD each newly-created feature RDD as I go. My code repeats something like this for each feature: feature = raw_data_rdd.map(...).reduceByKey(...).map(...) # create feature

Re: Is HDFS required for Spark streaming?

2015-09-04 Thread Tathagata Das
Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down. TD On Fri, Sep 4, 2015 at 2:45 PM, N B wrote: > Hello, > > We have a

Re: Problem while loading saved data

2015-09-04 Thread Amila De Silva
Hi Ewan, To start up the cluster I simply ran ./sbin/start-master.sh from master node and ./sbin/start-slave.sh from the slave. I didn't configure hdfs explicitly. Is there something additional that has to be done? On Fri, Sep 4, 2015 at 12:42 AM, Ewan Leith wrote:

SparkContext initialization error- java.io.IOException: No space left on device

2015-09-04 Thread shenyan zhen
Has anyone seen this error? Not sure which dir the program was trying to write to. I am running Spark 1.4.1, submitting Spark job to Yarn, in yarn-client mode. 15/09/04 21:36:06 ERROR SparkContext: Error adding jar (java.io.IOException: No space left on device), was the --addJars option used?

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Kohki Nishio
The stack trace is this java.lang.OutOfMemoryError: Java heap space at parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65) at parquet.bytes.CapacityByteArrayOutputStream.(CapacityByteArrayOutputStream.java:57) at

Drools and Spark Integration - Need Help

2015-09-04 Thread Shiva moorthy
Hi Team, I am able to integrate Drools with Apache spark but after integration my application runs slower. Could you please give ideas about how Drools can be efficiently integrated with Spark? Appreciate your help. Thanks and Regards, Shiva

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-04 Thread Adrien Mogenet
Not sure it would help and answer your question at 100%, but number of partitions is supposed to be at least roughly double of your number of cores (surprised to not see this point in your list), and can easily grow up to 10x, until you may notice a too large overhead – but that's not

Re: Parquet partitioning for unique identifier

2015-09-04 Thread Cheng Lian
Could you please provide the full stack track of the OOM exception? Another common case of Parquet OOM is super wide tables, say hundred or thousands of columns. And in this case, the number of rows is mostly irrelevant. Cheng On 9/4/15 1:24 AM, Kohki Nishio wrote: let's say I have a data

Drools integration with Spark

2015-09-04 Thread Shiva Moorthy
I am able to integrate Drools with Apache spark, but after integration my application runs slower. Could you please give ideas about how Drools can be eficiently integrated with Spark? Appreciate your help. Regards, Shiva -- View this message in context:

Re: Error using SQLContext in spark

2015-09-04 Thread Akhil Das
You need to add the spark-sql dependencies in your projects build file. Thanks Best Regards On Wed, Sep 2, 2015 at 6:41 PM, rakesh sharma wrote: > Error: application failed with exception >

Spark Streaming - Small file in HDFS

2015-09-04 Thread Pravesh Jain
Were you able to find a solution to your problem?