Spark on top of YARN Compression in iPython notebook

2015-05-10 Thread Bin Wang
I started a AWS cluster (1master + 3core) and download the prebuilt Spark binary. I downloaded the latest Anaconda Python and started a iPython notebook server by running the command below: ipython notebook --port --profile nbserver --no-browser Then, I try to develop a Spark

guardian failed, shutting down system

2015-05-10 Thread ??????
when i run spark application longtime. 15/05/11 12:50:47 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:839 15/05/11 12:50:47 INFO storage.BlockManagerInfo: Removed input-0-1431345102800 on dn2.vi:45688 in memory (size: 27.6 KB, free: 524.2 MB) 15/05/11

RE: [SparkSQL] cannot filter by a DateType column

2015-05-10 Thread Haopu Wang
Sorry, I was using Spark 1.3.x. I cannot reproduce it in master. But should I still open a JIRA because can I request it to be back ported to 1.3.x branch? Thanks again! From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Saturday, May 09,

Re: Nullable is true for the schema of parquet data

2015-05-10 Thread dsgriffin
Ran into this same issue. Only solution seems to be to coerce the DataFrame's schema back into the right state. Looks like you have to convert the DF to an RDD, which has an overhead. But otherwise this worked for me: val newDF = sqlContext.createDataFrame(origDF.rdd, new

Re: Does NullWritable can not be used in Spark?

2015-05-10 Thread donhoff_h
Hi, I am using hadoop2.5.2. My codes are listed as following. Besides, I also made some further tests. I found the following interesting result: 1.I will meet those exceptions when I set the Key Class as NullWritable, LongWritable, or IntWritable and used the

Re: Spark Cassandra connector number of Tasks

2015-05-10 Thread vijaypawnarkar
Looking for help with this. Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Cassandra-connector-number-of-Tasks-tp22820p22839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is it possible to set the akka specify properties (akka.extensions) in spark

2015-05-10 Thread Terry Hole
Hi all, I'd like to monitor the akka using kamon, which need to set the akka.extension to a list like this in typesafe config format: akka { extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD] } But i can not find a way to do this, i have tried these: 1.

Re: Loading file content based on offsets into the memory

2015-05-10 Thread in4maniac
As far as I know, that is not possible. If the file is too big to load to one node, What I would do is to use a RDD.map() function instead to load the file to distributed memory and then filter the lines that are relevant to me. I am not sure how to just read part of a single file. Sorry I'm

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
And one other suggestion in relation to the connection pool line of enquiry - check whether your cassandra service is configured to allow only one session per e.g. User I think the error is generated inside thr connection pool when it tries to initialize a connection after the first one Sent

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
And in case you are running in local mode try giving more cores to spark with e.g. [5] – low number could be interfering with the tuning params which you can try to play with as well – all this is in the context of how those params interact with the Connection Pool and what that pool is doing

spark streaming and computation

2015-05-10 Thread skippi
Assuming a web server access log shall be analyzed and target of computation shall be csv-files per time, e.g. one per day containing the minute-statistics and one per month containing the hour statistics. Incoming statistics are computed as discretized streams using spark streaming context.

EVent generation

2015-05-10 Thread anshu shukla
http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps -- Thanks Regards, Anshu Shukla

Re: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Gerard Maas
I'm familiar with the TableWriter code and that log only appears if the write actually succeeded. (See https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/writer/TableWriter.scala ) Thinking infrastructure, we see

Is it possible to set the akka specify properties (akka.extensions) in spark

2015-05-10 Thread Terry Hole
Hi all, I'd like to monitor the akka using kamon, which need to set the akka.extension to a list like this in typesafe config format: akka { extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD] } But i can not find a way to do this, i have tried these: 1.

Re: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Gerard Maas
It successfully writes some data and fails afterwards, like the host or connection goes down. Weird. Maybe you should post this question on the Spark-Cassandra connector group: https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user -kr, Gerard. On Sun, May 10, 2015

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
I think the message that it has written 2 rows is misleading If you look further down you will see that it could not initialize a connection pool for Casandra (presumably while trying to write the previously mentioned 2 rows) Another confirmation of this hypothesis is the phrase “error

Re: custom join using complex keys

2015-05-10 Thread ayan guha
This should work se1 = sc.parallelize(setupRow(10),1) base2 = sc.parallelize(setupRow(10),1) df1 = ssc.createDataFrame(base1) df2 = ssc.createDataFrame(base2) df1.show() df2.show() df1.registerTempTable(df1) df2.registerTempTable(df2) j = ssc.sql(select df1.k1

RE: Spark streaming closes with Cassandra Conector

2015-05-10 Thread Evo Eftimov
Hmm there is also a Connection Pool involved and such things (especially while still rough on the edges) may behave erratically in a distributed multithreaded environment Can you try forEachPartition and foreach together – this will create a slightly different multithreading execution

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
Hi Thanks for the quick response. No I'm not using Streaming. Each DataFrame represents tabular data read from a CSV file. They have the same schema. There is also the option of appending each DF to the parquet file, but then I can't maintain them as separate DF when reading back in without

Re: Does NullWritable can not be used in Spark?

2015-05-10 Thread Ted Yu
Looking at ./core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala : * Load an RDD saved as a SequenceFile containing serialized objects, with NullWritable keys and * BytesWritable values that contain a serialized partition. This is still an experimental storage ... def

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
How did you end up with thousands of df? Are you using streaming? In that case you can do foreachRDD and keep merging incoming rdds to single rdd and then save it through your own checkpoint mechanism. If not, please share your use case. On 11 May 2015 00:38, Peter Aberline

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
Hi In that case read entire folder as a rdd and give some reasonable number of partitions. Best Ayan On 11 May 2015 01:35, Peter Aberline peter.aberl...@gmail.com wrote: Hi Thanks for the quick response. No I'm not using Streaming. Each DataFrame represents tabular data read from a CSV