Re: Estimate size of Dataframe programatically

2015-08-10 Thread Srikanth
SizeEstimator.estimate(df) will not give the size of dataframe rt? I think it will give size of df object. With RDD, I sample() and collect() and sum size of each row. If I do the same with dataframe it will no longer be size when represented in columnar format. I'd also like to know how

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
That looks like it's during recovery from a checkpoint, so it'd be driver memory not executor memory. How big is the checkpoint directory that you're trying to restore from? On Mon, Aug 10, 2015 at 10:57 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: We're getting the below error.

ClosureCleaner does not work for java code

2015-08-10 Thread Hao Ren
Consider two code snippets as the following: // Java code: abstract class Ops implements Serializable{ public abstract Integer apply(Integer x); public void doSomething(JavaRDDInteger rdd) { rdd.map(x - x + apply(x)) .collect() .forEach(System.out::println); } } public

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
I wonder during recovery from a checkpoint whether we can estimate the size of the checkpoint and compare with Runtime.getRuntime().freeMemory(). If the size of checkpoint is much bigger than free memory, log warning, etc Cheers On Mon, Aug 10, 2015 at 9:34 AM, Dmitry Goldenberg

Re: ClosureCleaner does not work for java code

2015-08-10 Thread Sean Owen
The difference is really that Java and Scala work differently. In Java, your anonymous subclass of Ops defined in (a method of) AbstractTest captures a reference to it. That much is 'correct' in that it's how Java is supposed to work, and AbstractTest is indeed not serializable since you didn't

Re: multiple dependency jars using pyspark

2015-08-10 Thread Jonathan Haddad
I figured out the issue - it had to do with the Cassandra jar I had compiled. I had tested a previous version. Using --jars (comma separated) and --driver-class-path (colon separated) is working. On Mon, Aug 10, 2015 at 1:08 AM ayan guha guha.a...@gmail.com wrote: Easiest way should be to add

How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
We're getting the below error. Tried increasing spark.executor.memory e.g. from 1g to 2g but the below error still happens. Any recommendations? Something to do with specifying -Xmx in the submit job scripts? Thanks. Exception in thread main java.lang.OutOfMemoryError: GC overhead limit

Re: Estimate size of Dataframe programatically

2015-08-10 Thread Ted Yu
From a quick glance of SparkStrategies.scala , when statistics.sizeInBytes of the LogicalPlan is = autoBroadcastJoinThreshold, the plan's output would be used in broadcast join as the 'build' relation. FYI On Mon, Aug 10, 2015 at 8:04 AM, Srikanth srikanth...@gmail.com wrote:

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Would there be a way to chunk up/batch up the contents of the checkpointing directories as they're being processed by Spark Streaming? Is it mandatory to load the whole thing in one go? On Mon, Aug 10, 2015 at 12:42 PM, Ted Yu yuzhih...@gmail.com wrote: I wonder during recovery from a

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug 10, 2015 at 11:49 AM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Would there be a way to chunk up/batch up the contents of the

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Ted Yu
Looks like workaround is to reduce *window length.* *Cheers* On Mon, Aug 10, 2015 at 10:07 AM, Cody Koeninger c...@koeninger.org wrote: You need to keep a certain number of rdds around for checkpointing, based on e.g. the window size. Those would all need to be loaded at once. On Mon, Aug

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Eric Bless
Thank you Gen, the changes to HBaseConverters.scala look to now be returning all column qualifiers, as follows -  (u'row1', {u'qualifier': u'a', u'timestamp': u'1438716994027', u'value': u'value1', u'columnFamily': u'f1', u'type': u'Put', u'row': u'row1'}) (u'row1', {u'qualifier': u'b',

Re: subscribe

2015-08-10 Thread Brandon White
https://www.youtube.com/watch?v=H07zYvkNYL8 On Mon, Aug 10, 2015 at 10:55 AM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at the first section of https://spark.apache.org/community Cheers On Mon, Aug 10, 2015 at 10:54 AM, Phil Kallos phil.kal...@gmail.com wrote: please

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
Older versions of Spark (i.e. when it was still called SchemaRDD instead of DataFrame) did not support merging different parquet schema. However, Spark 1.4+ should. On Sat, Aug 8, 2015 at 8:58 PM, sim s...@swoop.com wrote: Adam, did you find a solution for this? -- View this message in

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, is there an example anywhere that demonstrates how this works with the schema changing over time? Must the Hive tables be set up as external tables outside of saveAsTable? In my experience, in 1.4.1, writing to a table with SaveMode.Append fails if the schema don't match. Thanks, Sim

Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread allonsy
Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I understood, each Kafka partition /is/ an RDD partition that will be processed by a single core. What I don't understand is the relation between those partitions and the blocks generated every blockInterval.

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Michael Armbrust
You can't create a DataFrame from an arbitrary object since we don't know how to figure out the schema. You can either create a JavaBean https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema or manually create a row + specify the schema

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am running as a yarn-client which probably means that the program that submitted the job is where the listening is also occurring? I thought that the yarn is only used to negotiate resources in yarn-client master mode. On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das t...@databricks.com wrote:

Re: Problem with take vs. takeSample in PySpark

2015-08-10 Thread Davies Liu
I tested this in master (1.5 release), it worked as expected (changed spark.driver.maxResultSize to 10m), len(sc.range(10).map(lambda i: '*' * (123) ).take(1)) 1 len(sc.range(10).map(lambda i: '*' * (124) ).take(1)) 15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized results of 1

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Luca
Thank you! :) 2015-08-10 19:58 GMT+02:00 Cody Koeninger c...@koeninger.org: There's no long-running receiver pushing blocks of messages, so blockInterval isn't relevant. Batch interval is what matters. On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote: Hi everyone, I

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
If you are running on a cluster, the listening is occurring on one of the executors, not in the driver. On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
In yarn-client mode, the driver is on the machine where you ran the spark-submit. The executors are running in the YARN cluster nodes, and the socket receiver listening on port is running in one of the executors. On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia mohitanch...@gmail.com wrote:

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Ted Yu
Umesh: Please take a look at the classes under: sql/core/src/main/scala/org/apache/spark/sql/parquet FYI On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Bo thanks much let me explain please see the following code JavaPairRDDString,PortableDataStream pairRdd =

Re: Pagination on big table, splitting joins

2015-08-10 Thread Michael Armbrust
I think to use *toLocalIterator* method and something like that, but I have doubts about memory and parallelism and sure there is a better way to do it. It will still run all earlier parts of the job in parallel. Only the actual retrieving of the final partitions will be serial. This is

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Any help in best recommendation for gracefully shutting down a spark stream application ? I am running it on yarn and a way to tell from externally either yarn application -kill command or some other way but need current batch to be processed completely and checkpoint to be saved before shutting

Re: Kafka direct approach: blockInterval and topic partitions

2015-08-10 Thread Cody Koeninger
There's no long-running receiver pushing blocks of messages, so blockInterval isn't relevant. Batch interval is what matters. On Mon, Aug 10, 2015 at 12:52 PM, allonsy luke1...@gmail.com wrote: Hi everyone, I recently started using the new Kafka direct approach. Now, as far as I

Re: How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread Umesh Kacha
Hi Michael thanks for the reply. I know that I can create DataFrame using JavaBean or Struct Type I want to know how can I create DataFrame from above code which is custom Hadoop format. On Tue, Aug 11, 2015 at 12:04 AM, Michael Armbrust mich...@databricks.com wrote: You can't create a

Re: How to create DataFrame from a binary file?

2015-08-10 Thread Umesh Kacha
Hi Bo thanks much let me explain please see the following code JavaPairRDDString,PortableDataStream pairRdd = javaSparkContext.binaryFiles(/hdfs/path/to/binfile); JavaRDDPortableDataStream javardd = pairRdd.values(); DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,

How to use custom Hadoop InputFormat in DataFrame?

2015-08-10 Thread unk1102
Hi I have my own Hadoop custom InputFormat which I want to use in DataFrame. How do we do that? I know I can use sc.hadoopFile(..) but then how do I convert it into DataFrame JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd =

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
You need to keep a certain number of rdds around for checkpointing -- that seems like a hefty expense to pay in order to achieve fault tolerance. Why does Spark persist whole RDD's of data? Shouldn't it be sufficient to just persist the offsets, to know where to resume from? Thanks. On Mon,

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
Hi, I am sorry to bother again. When I do join as follow: df = sqlContext.sql(selet a.someItem, b.someItem from a full outer join b on condition1 *or* condition2) df.first() The program failed at the result size is bigger than spark.driver.maxResultSize. It is really strange, as one record is no

Re: How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread Ted Yu
I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0 FYI On Mon, Aug 10, 2015 at 5:12 AM, mark manwoodv...@googlemail.com wrote: Hi All I need to be able to create, submit and report on Spark jobs programmatically in response to events arriving on a Kafka bus. I also

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-10 Thread Tathagata Das
I think this may be expected. When the streaming context is stopped without the SparkContext, then the receivers are stopped by the driver. The receiver sends back the message that it has been stopped. This is being (probably incorrectly) logged with ERROR level. On Sun, Aug 9, 2015 at 12:52 AM,

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
Is it receiving any data? If so, then it must be listening. Alternatively, to test these theories, you can locally running a spark standalone cluster (one node standalone cluster in local machine), and submit your app in client mode on that to see whether you are seeing the process listening on

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Michal Čizmazia
From logs, it seems that Spark Streaming does handle *kill -SIGINT* with graceful shutdown. Please could you confirm? Thanks! On 30 July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote: Yes I was doing same , if You mean that this is the correct way to do Then I will verify it

Fw: Your Application has been Received

2015-08-10 Thread Shing Hing Man
Bar123 On Monday, 10 August 2015, 20:20, Resourcing Team barclayscare...@invalidemail.com wrote: Dear Shing Hing, Thank you for applying to Barclays. We have received your application and are currently reviewing your details. Updates on your progress will be emailed to you and can

Java Streaming Context - File Stream use

2015-08-10 Thread Ashish Soni
Please help as not sure what is incorrect with below code as it gives me complilaton error in eclipse SparkConf sparkConf = new SparkConf().setMaster(local[4]).setAppName(JavaDirectKafkaWordCount); JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Dmitry Goldenberg
Well, RDDs also contain data, don't they? The question is, what can be so hefty in the checkpointing directory to cause Spark driver to run out of memory? It seems that it makes checkpointing expensive, in terms of I/O and memory consumption. Two network hops -- to driver, then to workers.

collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread YaoPau
I'm running Spark 1.3 on CDH 5.4.4, and trying to set up Spark to run via iPython Notebook. I'm getting collect() to work just fine, but take() errors. (I'm having issues with collect() on other datasets ... but take() seems to break every time I run it.) My code is below. Any thoughts? sc

Optimal way to implement a small lookup table for identifiers in an RDD

2015-08-10 Thread Mike Trienis
Hi All, I have an RDD of case class objects. scala case class Entity( | value: String, | identifier: String | ) defined class Entity scala Entity(hello, id1) res25: Entity = Entity(hello,id1) During a map operation, I'd like to return a new RDD that contains all of the

Re: avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Cody Koeninger
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations https://www.youtube.com/watch?v=fXnNEq1v3VA On Mon, Aug 10, 2015 at 4:32 PM, Shushant

When will window ....

2015-08-10 Thread Martin Senne
When will window functions be integrated into Spark (without HiveContext?) Gesendet mit AquaMail für Android http://www.aqua-mail.com Am 10. August 2015 23:04:22 schrieb Michael Armbrust mich...@databricks.com: You will need to use a HiveContext for window functions to work. On Mon, Aug 10,

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
Thanks! On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das t...@databricks.com wrote: 1. RPC can be done in many ways, and a web service is one of many ways. A even more hacky version can be the app polling a file in a file system, if the file exists start shutting down. 2. No need to set a

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Michael Armbrust
You will need to use a HiveContext for window functions to work. On Mon, Aug 10, 2015 at 1:26 PM, Jerry jerry.c...@gmail.com wrote: Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if

avoid duplicate due to executor failure in spark stream

2015-08-10 Thread Shushant Arora
Hi How can I avoid duplicate processing of kafka messages in spark stream 1.3 because of executor failure. 1.Can I some how access accumulators of failed task in retry task to skip those many events which are already processed by failed task on this partition ? 2.Or I ll have to persist each

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Ruslan Dautkhanov
There is was a similar problem reported before on this list. Weird python errors like this generally mean you have different versions of python in the nodes of your cluster. Can you check that? From error stack you use 2.7.10 |Anaconda 2.3.0 while OS/CDH version of Python is probably 2.6. --

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I am using the same exact code: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java Submitting like this: yarn: /opt/cloudera/parcels/CDH-5.4.0-1.cdh5.4.0.p0.27/bin/spark-submit --class

Re: TFIDF Transformation

2015-08-10 Thread pradyumnad
If you want to convert the hash to word, the very thought defies the usage of hashing. You may map the words with hashing, but that wouldn't be good. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TFIDF-Transformation-tp24086p24203.html Sent from the

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Davies Liu
Is it possible that you have Python 2.7 on the driver, but Python 2.6 on the workers?. PySpark requires that you have the same minor version of Python in both driver and worker. In PySpark 1.4+, it will do this check before run any tasks. On Mon, Aug 10, 2015 at 2:53 PM, YaoPau

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-08-10 Thread roni
Hi All, Any explanation for this? As Reece said I can do operations with hive but - val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) -- gives error. I have already created spark ec2 cluster with the spark-ec2 script. How can I build it again? Thanks _Roni On Tue, Jul 28, 2015

Re: Json parsing library for Spark Streaming?

2015-08-10 Thread pradyumnad
I use Play json, may be its very famous. If you would like to try below is the sbt dependency com.typesafe.play % play-json_2.10 % 2.2.1, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Json-parsing-library-for-Spark-Streaming-tp24016p24204.html Sent from

Random Forest and StringIndexer in pyspark ML Pipeline

2015-08-10 Thread pkphlam
Hi, If I understand the RandomForest model in the ML Pipeline implementation in the ml package correctly, I have to first run my outcome label variable through the StringIndexer, even if my labels are numeric. The StringIndexer then converts the labels into numeric indices based on frequency of

Re: Streaming of WordCount example

2015-08-10 Thread Mohit Anchlia
I do see this message: 15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am using the same

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
1. When you are running locally, make sure the master in the SparkConf reflects that and is not somehow set to yarn-client 2. You may not be getting any resources from YARN at all, so no executors, so no receiver running. That is why I asked the most basic question - Is it receiving data? That

Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Phil Kallos
Hi! Sorry if this is a repost. I'm using Spark + Kinesis ASL to process and persist stream data to ElasticSearch. For the most part it works nicely. There is a subtle issue I'm running into about how failures are handled. For example's sake, let's say I am processing a Kinesis stream that

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Michael Armbrust
What is the error you are getting? It would also be awesome if you could try with Spark 1.5 when the first preview comes out (hopefully early next week). On Mon, Aug 10, 2015 at 11:41 AM, Simeon Simeonov s...@swoop.com wrote: Michael, is there an example anywhere that demonstrates how this

Re: Spark inserting into parquet files with different schema

2015-08-10 Thread Simeon Simeonov
Michael, please, see http://apache-spark-user-list.1001560.n3.nabble.com/Schema-evolution-in-tables-tt23999.html The exception is java.lang.RuntimeException: Relation[ ... ] org.apache.spark.sql.parquet.ParquetRelation2@83a73a05 requires that the query in the SELECT clause of the INSERT

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
Hello, Using Apache Spark 1.4.1 I'm unable to use lag or lead when making queries to a data frame and I'm trying to figure out if I just have a bad setup or if this is a bug. As for the exceptions I get: when using selectExpr() with a string as an argument, I get NoSuchElementException: key not

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
The rdd is indeed defined by mostly just the offsets / topic partitions. On Mon, Aug 10, 2015 at 3:24 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: You need to keep a certain number of rdds around for checkpointing -- that seems like a hefty expense to pay in order to achieve fault

Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam, I have a plan to contribute some codes about passing filters to a datasource as physical planning. In more detail, I understand when we want to build up filter operations from data like Parquet (when actually reading and filtering HDFS blocks at first not filtering in memory

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-10 Thread canan chen
Anyone know this ? Thanks On Fri, Aug 7, 2015 at 4:20 PM, canan chen ccn...@gmail.com wrote: Is there any reason that historyserver use another property for the event log dir ? Thanks

Re: Accessing S3 files with s3n://

2015-08-10 Thread Akshat Aranya
Hi Jerry, Akhil, Thanks your your help. With s3n, the entire file is downloaded even while just creating the RDD with sqlContext.read.parquet(). It seems like even just opening and closing the InputStream causes the entire data to get fetched. As it turned out, I was able to use s3a and avoid

Differents in loading data using spark datasource api and using jdbc

2015-08-10 Thread 李铖
Hi,everyone. I have one question in loading data using spark datasource api and using jdbc that which way is effective?

Re: Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
HI All, I have tried Commands as mentioned below but still it is errors dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/ postgresql-9.4-1201.jdbc41.jar,/home/missingmerch/dse.jar,/home/missingmerch/spark- cassandra-connector-java_2.10-1.1.1.jar

Re: Differents in loading data using spark datasource api and using jdbc

2015-08-10 Thread satish chandra j
Hi, As I understand JDBC is meant for moderate voulme of data but Datasource api is a better option if volume of data volume is more Datasource API is not available is lower version of Spark such as 1.2.0 Regards, Satish On Tue, Aug 11, 2015 at 8:53 AM, 李铖 lidali...@gmail.com wrote:

Writing a DataFrame as compressed JSON

2015-08-10 Thread sim
DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. Uncompressed JSONlines is a very expensive from an I/O standpoint because field names are included in every record. Is

Re: How to programmatically create, submit and report on Spark jobs?

2015-08-10 Thread Ted Yu
For monitoring, please take a look at http://spark.apache.org/docs/latest/monitoring.html Especially REST API section. Cheers On Mon, Aug 10, 2015 at 8:33 AM, Ted Yu yuzhih...@gmail.com wrote: I found SPARK-3733 which was marked dup of SPARK-4924 which went to 1.4.0 FYI On Mon, Aug 10,

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things. That said, you could try it out. A better way to explicitly shutdown gracefully is to use an RPC to signal the driver process to start shutting down,

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Tathagata Das
Note that this is true only from Spark 1.4 where the shutdown hooks were added. On Mon, Aug 10, 2015 at 12:12 PM, Michal Čizmazia mici...@gmail.com wrote: From logs, it seems that Spark Streaming does handle *kill -SIGINT* with graceful shutdown. Please could you confirm? Thanks! On 30

Re: stopping spark stream app

2015-08-10 Thread Shushant Arora
By RPC you mean web service exposed on driver which listens and set some flag and driver checks that flag at end of each batch and if set then gracefully stop the application ? On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com wrote: In general, it is a little risky to put

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
1. RPC can be done in many ways, and a web service is one of many ways. A even more hacky version can be the app polling a file in a file system, if the file exists start shutting down. 2. No need to set a flag. When you get the signal from RPC, you can just call context.stop(stopGracefully =

Re: How to fix OutOfMemoryError: GC overhead limit exceeded when using Spark Streaming checkpointing

2015-08-10 Thread Cody Koeninger
No, it's not like a given KafkaRDD object contains an array of messages that gets serialized with the object. Its compute method generates an iterator of messages as needed, by connecting to kafka. I don't know what was so hefty in your checkpoint directory, because you deleted it. My

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10, 2015 at 2:38 PM,

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Tathagata Das
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not saving checkpoints correctly and was in general not reliable (even with WAL enabled). We have improved this in Spark 1.5 with updated Kinesis receiver, that keeps track of the Kinesis sequence numbers as part of the Spark

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-08-10 Thread Ted Yu
Yan / Bing: Mind taking a look at HBASE-14181 https://issues.apache.org/jira/browse/HBASE-14181 'Add Spark DataFrame DataSource to HBase-Spark Module' ? Thanks On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote: We are happy to announce the availability of the Spark

Re: collect() works, take() returns ImportError: No module named iter

2015-08-10 Thread Jon Gregg
We did have 2.7 on the driver, 2.6 on the edge nodes and figured that was the issue, so we've tried many combinations since then with all three of 2.6.6, 2.7.5, and Anaconda's 2.7.10 on each node with different PATHs and PYTHONPATHs each time. Every combination has produced the same error. We

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
Following Hadoop conventions, Spark won't overwrite an existing directory. You need to provide a unique output path every time you run the program, or delete or rename the target directory before you run the job. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: question about spark streaming

2015-08-10 Thread Dean Wampler
Have a look at the various versions of PairDStreamFunctions.updateStateByWindow ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). It supports updating running state in memory. (You can persist the state to a database/files

Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
HI All, Please help me to fix Spark Cassandra Connector issue, find the details below *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar *Error:* WARN

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

spark-kafka directAPI vs receivers based API

2015-08-10 Thread Mohit Durgapal
Hi All, I just wanted to know how does directAPI for spark streaming compare with earlier receivers based API. Has anyone used directAPI based approach on production or is it still being used for pocs? Also, since I'm new to spark, could anyone share a starting point from where I could find a

Re: Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Dean Wampler
org.apache.spark.streaming.twitter.TwitterInputDStream is a small class. You could write your own that lets you change the filters at run time. Then provide a mechanism in your app, like periodic polling of a database table or file for the list of filters. Dean Wampler, Ph.D. Author: Programming

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Yasemin Kaya
Thanx Dean, i am giving unique output path and in every time i also delete the directory before i run the job. 2015-08-10 15:30 GMT+03:00 Dean Wampler deanwamp...@gmail.com: Following Hadoop conventions, Spark won't overwrite an existing directory. You need to provide a unique output path

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
So, just before running the job, if you run the HDFS command at a shell prompt: hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult. Does it say the path doesn't exist? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Re: multiple dependency jars using pyspark

2015-08-10 Thread ayan guha
Easiest way should be to add both jars in SPARK_CLASSPATH as a colon separated string. On 10 Aug 2015 06:20, Jonathan Haddad j...@jonhaddad.com wrote: I'm trying to write a simple job for Pyspark 1.4 migrating data from MySQL to Cassandra. I can work with either the MySQL JDBC jar or the

Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Pankaj Narang
Hi All, I am creating spark twitter streaming connection in my app over long period of time. When I have some new keywords I need to add them to the spark streaming connection. I need to stop and start the current twitter streaming connection in this case. I have tried akka actor scheduling but

Re: Possible issue for Spark SQL/DataFrame

2015-08-10 Thread Akhil Das
Isnt it a space separated data? It is not a comma(,) separated nor pipe (|) separated data. Thanks Best Regards On Mon, Aug 10, 2015 at 12:06 PM, Netwaver wanglong_...@163.com wrote: Hi Spark experts, I am now using Spark 1.4.1 and trying Spark SQL/DataFrame API with text

Re: SparkR -Graphx Connected components

2015-08-10 Thread smagadi
Thanks for the response Robin is this the same for both Directed and undirected graphs ? val edges = Array(0L - 1L, 1L - 2L, 2L - 0L) ++ Array(3L - 4L, 4L - 5L, 5L - 3L) ++ Array(6L - 0L, 5L - 7L) val rawEdges = sc.parallelize(edges) val graph =

Possible issue for Spark SQL/DataFrame

2015-08-10 Thread Netwaver
Hi Spark experts, I am now using Spark 1.4.1 and trying Spark SQL/DataFrame API with text file in below format id gender height 1 M 180 2 F 167 ... ... But I meet

SparkR -Graphx Cliques

2015-08-10 Thread smagadi
How to find the cliques using spark graphx ? a quick code snippet is appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Graphx-Cliques-tp24191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark on YARN

2015-08-10 Thread Jem Tucker
Hi, I have looked at the UI scheduler tab and it appears my new user was allocated less cores than my other user, is there any way i can avoid this happening? Thanks, Jem On Sat, Aug 8, 2015 at 8:32 PM Shushant Arora shushantaror...@gmail.com wrote: which is the scheduler on your cluster.

Spark with GCS Connector - Rate limit error

2015-08-10 Thread Oren Shpigel
Hi, I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended here https://cloud.google.com/hadoop/google-cloud-storage-connector#benefits ), and get a lot of rate limit errors, as added below. The errors relate to temp files

Kinesis records are merged with out obvious way of separating them

2015-08-10 Thread raam
I am using spark 1.4.1 with connector 1.4.0 When I post events slowly and the are being picked one by one everything runs smoothly but when the stream starts delivering batched records there is no obvious way to separate them. Am i missing something? How do I separate the records when they are

question about spark streaming

2015-08-10 Thread sequoiadb
hi guys, i have a question about spark streaming. There’s an application keep sending transaction records into spark stream with about 50k tps The record represents a sales information including customer id / product id / time / price columns The application is required to monitor the change

Differents of loading data

2015-08-10 Thread 李铖
What is the differents of loading data using jdbc and loading data using spard data source api? or differents of loading data using mongo-hadoop and loading data using native java driver? Which way is better?

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
I don't know if DSE changed spark-submit, but you have to use a comma-separated list of jars to --jars. It probably looked for HelloWorld in the second one, the dse.jar file. Do this: dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/

Re: spark-kafka directAPI vs receivers based API

2015-08-10 Thread Cody Koeninger
For direct stream questions: https://github.com/koeninger/kafka-exactly-once Yes, it is used in production. For general spark streaming question: http://spark.apache.org/docs/latest/streaming-programming-guide.html On Mon, Aug 10, 2015 at 7:51 AM, Mohit Durgapal durgapalmo...@gmail.com

Re: Spark Cassandra Connector issue

2015-08-10 Thread satish chandra j
Hi, Thanks for quick input, now I am getting class not found error *Command:* dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/dse.jar

Re: How to connect to spark remotely from java

2015-08-10 Thread Simon Elliston Ball
You don't connect to spark exactly. The spark client (running on your remote machine) submits jobs to the YARN cluster running on HDP. What you probably need is yarn-cluster or yarn-client with the yarn client configs as downloaded from the Ambari actions menu. Simon On 10 Aug 2015, at

How to connect to spark remotely from java

2015-08-10 Thread Zsombor Egyed
Hi! I want to know how can I connect to hortonworks spark from an other machine. So there is a HDP 2.2 and I want to connect to this, from remotely via java api. Do you have any suggestion? Thanks! Regards, -- *Egyed Zsombor * Junior Big Data Engineer Mobile: +36 70 320 65 81 |

  1   2   >