Re: Create multiple rows from elements in array on a single row

2015-06-08 Thread Jeetendra Gangele
mapTopair that time you can break the key. On 8 June 2015 at 23:27, Bill Q bill.q@gmail.com wrote: Hi, I have a rdd with the following structure: row1: key: Seq[a, b]; value: value 1 row2: key: seq[a, c, f]; value: value 2 Is there an efficient way to de-flat the rows into? row1: key:

Re: SparkSQL nested dictionaries

2015-06-08 Thread Davies Liu
I think it works in Python ``` df = sqlContext.createDataFrame([(1, {'a': 1})]) df.printSchema() root |-- _1: long (nullable = true) |-- _2: map (nullable = true) ||-- key: string ||-- value: long (valueContainsNull = true) df.select(df._2.getField('a')).show() +-+ |_2[a]|

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-08 Thread EH
It turns out there is a bug in the code which makes an infinite loop some time after start. :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23210.html Sent from the Apache Spark User List mailing list

Re: which database for gene alignment data ?

2015-06-08 Thread roni
Sorry for the delay. The files (called .bed files) have format like - Chromosome start endfeature score strand chr1 713776 714375 peak.1 599+ chr1 752401 753000 peak.2 599+ The mandatory fields are 1. chrom - The name of the chromosome (e.g. chr3, chrY,

Re: which database for gene alignment data ?

2015-06-08 Thread Frank Austin Nothaft
Hi Roni, We have a full suite of genomic feature parsers that can read BED, narrowPeak, GATK interval lists, and GTF/GFF into Spark RDDs in ADAM Additionally, we have support for efficient overlap joins (query 3 in your email below). You can load the genomic features with

Create multiple rows from elements in array on a single row

2015-06-08 Thread Bill Q
Hi, I have a rdd with the following structure: row1: key: Seq[a, b]; value: value 1 row2: key: seq[a, c, f]; value: value 2 Is there an efficient way to de-flat the rows into? row1: key: a; value: value1 row2: key: a; value: value2 row3: key: b; value: value1 row4: key: c; value: value2 row5:

k-means for text mining in a streaming context

2015-06-08 Thread Ruslan Dautkhanov
Hello, https://spark.apache.org/docs/latest/mllib-feature-extraction.html would Feature Extraction and Transformation work in a streaming context? Wanted to extract text features, build K-means clusters for streaming context to detect anomalies on a continuous text stream. Would it be possible?

Re: Spark 1.3.1 On Mesos Issues.

2015-06-08 Thread John Omernik
It appears this may be related. https://issues.apache.org/jira/browse/SPARK-1403 Granted the NPE is in MapR's code, having Spark (seemingly, I am not an expert here, just basing it off the comments) switch in its behavior (if that's what it is doing) probably isn't good either. I guess the level

RDD of RDDs

2015-06-08 Thread ping yan
Hi, The problem I am looking at is as follows: - I read in a log file of multiple users as a RDD - I'd like to group the above RDD into *multiple RDDs* by userIds (the key) - my processEachUser() function then takes in each RDD mapped into each individual user, and calls for RDD.map or

Re: [Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-08 Thread Dibyendu Bhattacharya
Hi Snehal Are you running the latest kafka consumer from github/spark-packages ? If not can you take the latest changes. This low level receiver will make attempt to keep trying if underlying BlockManager gives error. Are you see those retry cycle in log ? If yes then there is issue writing

Re: How does lineage get passed down in RDDs

2015-06-08 Thread maxdml
If I read the code correctly, in RDD.scala, each rdd keeps track of it's own dependencies, (from Dependency.scala), and has methods to access to it's /ancestors/ dependencies, thus being able to recompute the lineage (see getNarrowAncestors() or getDependencies() in some rdd like UnionRDD). So it

[Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-08 Thread Snehal Nagmote
All, I am using Kafka Spark Consumer https://github.com/dibbhatt/kafka-spark-consumer in spark streaming job . After spark streaming job runs for few hours , all executors exit and I still see status of application on SPARK UI as running Does anyone know cause of this exception and how to fix

spark eventLog and history server

2015-06-08 Thread Du Li
Event log is enabled in my spark streaming app. My code runs in standalone mode and the spark version is 1.3.1. I periodically stop and restart the streaming context by calling ssc.stop(). However, from the web UI, when clicking on a past job, it says the job is still in progress and does not

Running SparkPi ( or JavaWordCount) example fails with Job aborted due to stage failure: Task serialization failed

2015-06-08 Thread Elkhan Dadashov
Hello, Running Spark examples fails on one machine, but succeeds in Virtual Machine with exact same Spark Java version installed. The weird part it fails on one machine, but runs successfully on VM. Did anyone face same problem ? Any solution tip ? Thanks in advance. *Spark version*:

Re: Wired Problem: Task not serializable[Spark Streaming]

2015-06-08 Thread bit1...@163.com
Could someone help explain what happens that leads to the Task not serializable issue? Thanks. bit1...@163.com From: bit1...@163.com Date: 2015-06-08 19:08 To: user Subject: Wired Problem: Task not serializable[Spark Streaming] Hi, With the following simple code, I got an exception that

Re: Wired Problem: Task not serializable[Spark Streaming]

2015-06-08 Thread Michael Albert
Note that in scala, return is a non-local return:  https://tpolecat.github.io/2014/05/09/return.htmlSo that return is *NOT* returning from the anonymous function, but attempting to return from the enclosing method, i.e., main.Which is running on the driver, not on the workers.So on the workers,

Re: Running SparkSql against Hive tables

2015-06-08 Thread James Pirz
Thanks for the help! I am actually trying Spark SQL to run queries against tables that I've defined in Hive. I follow theses steps: - I start hiveserver2 and in Spark, I start Spark's Thrift server by: $SPARK_HOME/sbin/start-thriftserver.sh --master spark://spark-master-node-ip:7077 - and I

Re: Running SparkSql against Hive tables

2015-06-08 Thread Cheng Lian
On 6/9/15 8:42 AM, James Pirz wrote: Thanks for the help! I am actually trying Spark SQL to run queries against tables that I've defined in Hive. I follow theses steps: - I start hiveserver2 and in Spark, I start Spark's Thrift server by: $SPARK_HOME/sbin/start-thriftserver.sh --master

Re: [Kafka-Spark-Consumer] Spark-Streaming Job Fails due to Futures timed out

2015-06-08 Thread Dibyendu Bhattacharya
Seems to be related to this JIRA : https://issues.apache.org/jira/browse/SPARK-3612 ? On Tue, Jun 9, 2015 at 7:39 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi Snehal Are you running the latest kafka consumer from github/spark-packages ? If not can you take the latest

Spark SQL with Thrift Server is very very slow and finally failing

2015-06-08 Thread Sourav Mazumder
Hi, I am trying to run a SQL form a JDBC driver using Spark's Thrift Server. I'm doing a join between a Hive Table of size around 100 GB and another Hive Table with 10 KB, with a filter on a particular column The query takes more than 45 minutes and then I get ExecutorLostFailure. That is

Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-08 Thread amit tewari
Hi Dear Spark Users I am very new to Spark/Scala. Am using Datastax (4.7/Spark 1.2.1) and struggling with following error/issue. Already tried options like import org.apache.spark.SparkContext._ or explicit import org.apache.spark.SparkContext.rddToPairRDDFunctions. But error not resolved.

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-08 Thread Ted Yu
join is operation of DataFrame You can call sc.createDataFrame(myRDD) to obtain DataFrame where sc is sqlContext Cheers On Mon, Jun 8, 2015 at 9:44 PM, amit tewari amittewar...@gmail.com wrote: Hi Dear Spark Users I am very new to Spark/Scala. Am using Datastax (4.7/Spark 1.2.1) and

Spark compilation issue on intellij

2015-06-08 Thread canan chen
Maybe someone has asked this question before. I have this compilation issue when compiling spark sql. And I found couple of posts on stackoverflow, but did'nt work for me. Does anyone has experience on this ? thanks http://stackoverflow.com/questions/26788367/quasiquotes-in-intellij-14

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-08 Thread Mark Hamstra
Correct; and PairRDDFunctions#join does still exist in versions of Spark that do have DataFrame, so you don't necessarily have to use DataFrame to do this even then (although there are advantages to using the DataFrame approach.) Your basic problem is that you have an RDD of tuples, where each

Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
Hi all, I'm storing an rdd as sequencefile with the following content: key=filename(string) value=python str from numpy.savez(not unicode) In order to make sure the whole numpy array get's stored I have to first serialize it with: def serialize_numpy_array(numpy_array): output = io.BytesIO()

ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-08 Thread Dong Lei
Hi, spark-users: I'm using spark-submit to submit multiple jars and files(all in HDFS) to run a job, with the following command: Spark-submit --class myClass --master spark://localhost:7077/ --deploy-mode cluster --jars hdfs://localhost/1.jar, hdfs://localhost/2.jar --files

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
Update: Using bytearray before storing to RDD is not a solution either. This happens when trying to read the RDD when the value was stored as python bytearray: Traceback (most recent call last): [0/9120] File /vagrant/python/kmeans.py, line 24, in module features =

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-08 Thread amit tewari
Thanks, but Spark 1.2 doesnt yet have DataFrame I guess? Regards Amit On Tue, Jun 9, 2015 at 10:25 AM, Ted Yu yuzhih...@gmail.com wrote: join is operation of DataFrame You can call sc.createDataFrame(myRDD) to obtain DataFrame where sc is sqlContext Cheers On Mon, Jun 8, 2015 at 9:44

Re: Cassandra Submit

2015-06-08 Thread Yasemin Kaya
Thanks alot Mohammed, Gerard and Yana. I can write to table, but exception returns me. It says *Exception in thread main java.io.IOException: Failed to open thrift connection to Cassandra at 127.0.0.1:9160 http://127.0.0.1:9160* In yaml file : rpc_address: localhost rpc_port: 9160 And at project

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-08 Thread Ted Yu
Which Spark release are you using ? Can you pastebin the stack trace w.r.t. ExecutorLostFailure ? Thanks On Mon, Jun 8, 2015 at 8:52 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I am trying to run a SQL form a JDBC driver using Spark's Thrift Server. I'm doing a join

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-06-08 Thread Sam Stoelinga
Update: I've done a workaround to use saveAsPickleFile instead which handles everything correctly. It stays in byte format. Noticed python got messy with str and byte being the same in Python 2.7, wondering whether using Python 3 would have the same problem. I would still like to use a cross

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Haopu Wang
Cheng, thanks for the response. Yes, I was using HiveContext.setConf() to set dfs.replication. However, I cannot change the value in Hadoop core-site.xml because that will change every HDFS file. I only want to change the replication factor of some specific files. -Original Message-

Good Spark consultants?

2015-06-08 Thread jakeheller
I was wondering if there were any consultants in high standing in the community. We are considering using Spark, and we'd love to have someone with a lot of experience help us get up to speed and implement a preexisting data pipeline to use Spark (and perhaps first help answer the question of

How to obtain ActorSystem and/or ActorFlowMaterializer in updateStateByKey

2015-06-08 Thread algermissen1971
Hi, I am writing some code inside an update function for updateStateByKey that flushes data to a remote system using akk-http. For the akka-http request I need an ActorSystem and an ActorFlowMaterializer. Can anyone share a pattern or insights that address the following questions: - Where and

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
James, As I can see, there are three distinct parts to your program: - for loop - synchronized block - final outputFrame.save statement Can you do a separate timing measurement by putting a simple System.currentTimeMillis() around these blocks to know how much they are taking and then

Re: Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-08 Thread Akhil Das
Can you look in your worker logs for more detailed stack-trace? If its about winutils.exe you can look at these links to get it resolved. - http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 - https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Mon, Jun 8,

RE: Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-08 Thread Dong Lei
Thanks Akhil so such! It turns out to be HADOOP_HOME not set. Dong Lei From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, June 8, 2015 3:12 PM To: Dong Lei Cc: user@spark.apache.org Subject: Re: Driver crash at the end with InvocationTargetException when running SparkPi Can you

Re: Cassandra Submit

2015-06-08 Thread Gerard Maas
? = ip address of your cassandra host On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote: Hi , How can I find spark.cassandra.connection.host? And what should I change ? Should I change cassandra.yaml ? Error says me *Exception in thread main java.io.IOException:

Re: hiveContext.sql NullPointerException

2015-06-08 Thread patcharee
Hi, Thanks for your guidelines. I will try it out. Btw how do you know HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side? Where can I find document? BR, Patcharee On 07. juni 2015 16:40, Cheng Lian wrote: Spark SQL supports Hive dynamic

Re: Cassandra Submit

2015-06-08 Thread Yasemin Kaya
Hi , How can I find spark.cassandra.connection.host? And what should I change ? Should I change cassandra.yaml ? Error says me *Exception in thread main java.io.IOException: Failed to open native connection to Cassandra at {127.0.1.1}:9042* What should I add *SparkConf sparkConf = new

Re: Examples of flatMap in dataFrame

2015-06-08 Thread Ram Sriharsha
Hi You are looking for the explode method (in Dataframe API starting 1.3 I believe) https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1002 Ram On Sun, Jun 7, 2015 at 9:22 PM, Dimp Bhat dimp201...@gmail.com wrote: Hi, I'm trying to write

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
Strange, you can manually start it by login in to the Worker machine and then issuing this command: sbin/start-slave.sh 1 spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT Thanks Best Regards On Mon, Jun 8, 2015 at 3:44 PM, James King jakwebin...@gmail.com wrote: Thanks Akhil, yes that works fine

Re: Column operation on Spark RDDs.

2015-06-08 Thread lonikar
Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job Below your code with the above changes: var dataRDD = sc.textFile(/test.csv).map(_.split(,)) val dt =

path to hdfs

2015-06-08 Thread Pa Rö
hello, i submit my spark job with the following parameters: ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \ --class mgm.tp.bigdata.ma_spark.SparkMain \ --master spark://quickstart.cloudera:7077 \ ma-spark.jar \ 1000 and get the following exception: java.io.IOException: Mkdirs failed to

Re: path to hdfs

2015-06-08 Thread Nirmal Fernando
HDFS path should be something like; hdfs:// 127.0.0.1:8020/user/cloudera/inputs/ On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com wrote: hello, i submit my spark job with the following parameters: ./spark-1.1.0-bin-hadoop2.4/bin/spark-submit \ --class

Re: Error in using saveAsParquetFile

2015-06-08 Thread Jeetendra Gangele
Parquet file when are you loading these file? can you please share the code where you are passing parquet file to spark?. On 8 June 2015 at 16:39, Cheng Lian lian.cs@gmail.com wrote: Are you appending the joined DataFrame whose PolicyType is string to an existing Parquet file whose

Official Mllib API does not correspond to auto completion

2015-06-08 Thread Jean-Charles RISCH
Hi, I am playing with Mllib (Spark 1.3.1) and my auto completion propositions don't correspond to the official API. Here are my dependencies : libraryDependencies += org.apache.hadoop % hadoop-client % 1.0.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.1 excludeAll(

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote: I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have

Re: hiveContext.sql NullPointerException

2015-06-08 Thread Cheng Lian
On 6/8/15 4:02 PM, patcharee wrote: Hi, Thanks for your guidelines. I will try it out. Btw how do you know HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side? Where can I find document? I'm afraid we don't state this explicitly on the SQL

Re: spark ssh to slave

2015-06-08 Thread James King
Thanks Akhil, yes that works fine it just lets me straight in. On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
For DataFrame, there are also transformations and actions. And transformations are also lazily evaluated. However, DataFrame transformations like filter(), select(), agg() return a DataFrame rather than an RDD. Other methods like show() and collect() are actions. Cheng On 6/8/15 1:33 PM,

Re: Error in using saveAsParquetFile

2015-06-08 Thread Cheng Lian
Are you appending the joined DataFrame whose PolicyType is string to an existing Parquet file whose PolicyType is int? The exception indicates that Parquet found a column with conflicting data types. Cheng On 6/8/15 5:29 PM, bipin wrote: Hi I get this error message when saving a table:

Re: Cassandra Submit

2015-06-08 Thread Yasemin Kaya
Hi, I run my project on local. How can find ip address of my cassandra host ? From cassandra.yaml or ?? yasemin 2015-06-08 11:27 GMT+03:00 Gerard Maas gerard.m...@gmail.com: ? = ip address of your cassandra host On Mon, Jun 8, 2015 at 10:12 AM, Yasemin Kaya godo...@gmail.com wrote: Hi

Jobs aborted due to EventLoggingListener Filesystem closed

2015-06-08 Thread igor.berman
I'm getting sometimes errors like below spark 1.3.1 history enabled to hdfs I've found few jiras but they seems to be resolved, e.g. https://issues.apache.org/jira/browse/SPARK-1475 any ideas? 2015-06-08 08:33:06.426 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception

spark ssh to slave

2015-06-08 Thread James King
I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have exchanged public keys so they have free access to each other. But when I do spark home/sbin/start-all.sh from 192.168.1.15 I still get 192.168.1.16: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Re: Column operation on Spark RDDs.

2015-06-08 Thread kiran lonikar
Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job Below your code with the above changes: var dataRDD = sc.textFile(/test.csv).map(_.split(,)) val dt =

Error in using saveAsParquetFile

2015-06-08 Thread bipin
Hi I get this error message when saving a table: parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: optional binary PolicyType (UTF8) != optional int32 PolicyType at

Re: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-08 Thread Cheng Lian
Then one possible workaround is to set dfs.replication in sc.hadoopConfiguration. However, this configuration is shared by all Spark jobs issued within the same application. Since different Spark jobs can be issued from different threads, you need to pay attention to synchronization. Cheng

Re: path to hdfs

2015-06-08 Thread Jeetendra Gangele
your HDFS path to spark job is incorrect. On 8 June 2015 at 16:24, Nirmal Fernando nir...@wso2.com wrote: HDFS path should be something like; hdfs:// 127.0.0.1:8020/user/cloudera/inputs/ On Mon, Jun 8, 2015 at 4:15 PM, Pa Rö paul.roewer1...@googlemail.com wrote: hello, i submit my spark

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
Thanks, did that and now I am getting an out of memory. But I am not sure where this occurs. It can't be on the spark executor as I have 28GB allocated to it. It is not the driver because I run this locally and monitor it via jvisualvm. Unfortunately I can't jmx-monitor hadoop. From the

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread ayan guha
I would think DF=RDD+Schema+some additional methods. In fact, a DF object has a DF.rdd in it so you can (if needed) convert DF=RDD really easily. On Mon, Jun 8, 2015 at 5:41 PM, kiran lonikar loni...@gmail.com wrote: Thanks. Can you point me to a place in the documentation of SQL programming

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Thanks. Can you point me to a place in the documentation of SQL programming guide or DataFrame scaladoc where this transformation and actions are grouped like in the case of RDD? Also if you can tell me if sqlContext.load and unionAll are transformations or actions... I answered a question on

How to decrease the time of storing block in memory

2015-06-08 Thread luohui20001
hi there I am trying to descrease my app's running time in worker node. I checked the log and found the most time-wasting part is below:15/06/08 16:14:23 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.1 KB, free 353.3 MB) 15/06/08 16:14:42 INFO

SparkSQL nested dictionaries

2015-06-08 Thread mrm
Hi, Is it possible to query a data structure that is a dictionary within a dictionary? I have a parquet file with a a structure: test |key1: {key_string: val_int} |key2: {key_string: val_int} if I try to do: parquetFile.test -- Columntest parquetFile.test.key2 -- AttributeError:

Re: coGroup on RDDPojos

2015-06-08 Thread Daniel Darabos
I suggest you include your code and the error message! It's not even immediately clear what programming language you mean to ask about. On Mon, Jun 8, 2015 at 2:50 PM, elbehery elbeherymust...@gmail.com wrote: Hi, I have two datasets of customer types, and I would like to apply coGrouping on

spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Kostas Kougios
I am reading millions of xml files via val xmls = sc.binaryFiles(xmlDir) The operation runs fine locally but on yarn it fails with: client token: N/A diagnostics: Application application_1433491939773_0012 failed 2 times due to ApplicationMaster for attempt

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Try putting a * on the end of xmlDir, i.e. xmlDir = fdfs:///abc/def/* Rather than xmlDir = Hdfs://abc/def and see what happens. I don't know why, but that appears to be more reliable for me with S3 as the filesystem. I'm also using binaryFiles, but I've tried running the same command while

Saving compressed textFiles from a DStream in Scala

2015-06-08 Thread Bob Corsaro
It looks like saveAsTextFiles doesn't support the compression parameter of RDD.saveAsTextFile. Is there a way to add the functionality in my client code without patching Spark? I tried making my own saveFunc function and calling DStream.foreachRDD but ran into trouble with invoking rddToFileName

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
No luck I am afraid. After giving the namenode 16GB of RAM, I am still getting an out of mem exception, kind of different one: 15/06/08 15:35:52 ERROR yarn.ApplicationMaster: User class threw exception: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at

Re: Error in using saveAsParquetFile

2015-06-08 Thread Bipin Nag
Hi Jeetendra, Cheng I am using following code for joining val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings) val Customerdetails = sqlContext.load(/home/administrator/stageddata/Customerdetails) val CD = Customerdetails. where($CreatedOn 2015-04-01 00:00:00.0).

Re: Jobs aborted due to EventLoggingListener Filesystem closed

2015-06-08 Thread igor.berman
for the sake of the history : DON'T do System.exit within spark code -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Jobs-aborted-due-to-EventLoggingListener-Filesystem-closed-tp23202p23205.html Sent from the Apache Spark User List mailing list archive at

coGroup on RDDPojos

2015-06-08 Thread elbehery
Hi, I have two datasets of customer types, and I would like to apply coGrouping on them. I could not find example for that on the website, and I have tried to apply this on my code, but the compiler complained .. Any suggestions ? -- View this message in context:

Transform Functions and Python Modules

2015-06-08 Thread John Omernik
I am learning more about Spark (and in this case Spark Streaming) and am getting that a functions like dstream.map() takes a function call and does something to each element of the rdd and that in turn returns a new rdd based on the original. That's cool for the simple map functions in the

unsubscribe

2015-06-08 Thread Ricardo Goncalves da Silva
unsubscribe [Descrição: Descrição: Descrição: cid:image002.jpg@01CC89A8.2B628650] Ricardo Goncalves da Silva Lead Data Scientist | Seção de Desenvolvimento de Sistemas de Business Intelligence - Projetos de Inovação | IDPB02 Av. Eng. Luis Carlos Berrini, 1.376 - 7º - 04571-000 - SP

FileOutputCommitter deadlock 1.3.1

2015-06-08 Thread Richard Marscher
Hi, we've been seeing occasional issues in production with the FileOutCommitter reaching a deadlock situation. We are writing our data to S3 and currently have speculation enabled. What we see is that Spark get's a file not found error trying to access a temporary part file that it wrote

RE: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Ewan Leith
Can you do a simple sc.binaryFiles(hdfs:///path/to/files/*).count() in the spark-shell and verify that part works? Ewan -Original Message- From: Konstantinos Kougios [mailto:kostas.koug...@googlemail.com] Sent: 08 June 2015 15:40 To: Ewan Leith; user@spark.apache.org Subject: Re:

Re: unsubscribe

2015-06-08 Thread Ted Yu
Send email to user-unsubscr...@spark.apache.org Cheers 2015-06-08 7:50 GMT-07:00 Ricardo Goncalves da Silva ricardog.si...@telefonica.com: unsubscribe [image: Descrição: Descrição: Descrição: cid:image002.jpg@01CC89A8.2B628650] *Ricardo Goncalves da Silva* Lead Data Scientist *|*

Re: Cassandra Submit

2015-06-08 Thread Yana Kadiyska
yes, whatever you put for listen_address in cassandra.yaml. Also, you should try to connect to your cassandra cluster via bin/cqlsh to make sure you have connectivity before you try to make a a connection via spark. On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya godo...@gmail.com wrote: Hi, I

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread Cheng Lian
You may refer to DataFrame Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame Methods listed in Language Integrated Queries and RDD Options can be viewed as transformations, and those listed in Actions are, of course, actions. As for

Re: spark timesout maybe due to binaryFiles() with more than 1 million files in HDFS

2015-06-08 Thread Konstantinos Kougios
It was giving the same error, which made me figure out it is the driver but the driver running on hadoop - not the local one. So I did --conf spark.driver.memory=8g and now it is processing the files! Cheers On 08/06/15 15:52, Ewan Leith wrote: Can you do a simple

Re: Error in using saveAsParquetFile

2015-06-08 Thread Cheng Lian
I suspect that Bookings and Customerdetails both have a PolicyType field, one is string and the other is an int. Cheng On 6/8/15 9:15 PM, Bipin Nag wrote: Hi Jeetendra, Cheng I am using following code for joining val Bookings = sqlContext.load(/home/administrator/stageddata/Bookings) val

Re: columnar structure of RDDs from Parquet or ORC files

2015-06-08 Thread kiran lonikar
Hi Cheng, Ayan, Thanks for the answers. I like the rule of thumb. I cursorily went through the DataFrame, SQLContext and sql.execution.basicOperators.scala code. It is apparent that these functions are lazily evaluated. The SQLContext.load functions are similar to SparkContext.textFile kind of

Re: Optimisation advice for Avro-Parquet merge job

2015-06-08 Thread kiran lonikar
It turns out my assumption on load and unionAll being blocking is not correct. They are transformations. So instead of just running only the load and unionAll in the run() methods, I think you will have to save the intermediate dfInput[i] to temp (parquet) files (possibly to in memory DFS like