Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Umesh Kacha
Hi thanks Void works I use same custom format in Hive and it works with Void as key. Please share example if you have to create DataFrame using custom Hadoop format. On Aug 1, 2015 2:07 AM, Ted Yu yuzhih...@gmail.com wrote: I don't think using Void class is the right choice - it is not even a

RandomForest in Pyspark (version 1.4.1)

2015-07-31 Thread SK
Hi, I tried to develop a RandomForest model for my data in PySpark as follows: rf_model = RandomForest.trainClassifier(train_idf, 2, {}, numTrees=15, seed=144) print RF: Num trees = %d, Num nodes = %d\n %(rf_model.numTrees(), rf_model.totalNumNodes()) pred_label =

Re: Big Integer number in Spark

2015-07-31 Thread ssingal05
Do you notice how you are making a List of Int's? input: org.apache.spark.rdd.RDD[*Int*] = ParallelCollectionRDD[0] at parallelize at console:21 And these are also being mapped to more Int's result: org.apache.spark.rdd.RDD[*Int*] = MapPartitionsRDD[1] at map at console:23 Generally,

[POWERED BY] Please add Typesafe to the list of organizations

2015-07-31 Thread Dean Wampler
Typesafe (http://typesafe.com). We provide commercial support for Spark on Mesos and Mesosphere DCOS. We contribute to Spark's Mesos integration and Spark Streaming enhancements. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Re: setting fs.umask in pyspark

2015-07-31 Thread aesilberstein
Found an approach: sc._jsc.hadoopConfiguration().set(key, value) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/setting-fs-umask-in-pyspark-tp24070p24098.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

2015-07-31 Thread Sebastian Kalix
Thanks for the quick reply. I will be unable to collect more data until Monday though, but I will update the thread accordingly. I am using Spark 1.4.0. Were there any related issues reported? I wasn't able to find any, but I may have overlooked something. I have also updated the original

Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

2015-07-31 Thread Ted Yu
Can you call collect() and log the output to get more clue what is left ? Which Spark release are you using ? Cheers On Fri, Jul 31, 2015 at 9:01 AM, Warfish sebastian.ka...@gmail.com wrote: Hi everyone, I work with Spark for a little while now and have encountered a strange problem that

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea how to solve this problem? Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, July 30, 2015 11:06 AM To: user@spark.apache.org Subject: How to register array class with Kyro in spark-defaults.conf I register my class with Kyro in spark-defaults.conf as follow

Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode

2015-07-31 Thread Warfish
Hi everyone, I work with Spark for a little while now and have encountered a strange problem that gives me headaches, which has to do with the JavaRDD.subtract method. Consider the following piece of code: public static void main(String[] args) { //context is of type

Setting a stage timeout

2015-07-31 Thread William Kinney
Hi, I had a job that got stuck on yarn due to https://issues.apache.org/jira/browse/SPARK-6954 It never exited properly. Is there a way to set a timeout for a stage or all stages?

Re: SparkLauncher not notified about finished job - hangs infinitely.

2015-07-31 Thread Elkhan Dadashov
Nope, output stream of that subprocess should be spark.getInputStream() According to Oracle Doc https://docs.oracle.com/javase/8/docs/api/java/lang/Process.html#getOutputStream-- : public abstract InputStream https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html getInputStream()

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-31 Thread Umesh Kacha
Hi thanks for the response. It looks like YARN container is getting killed but dont know why I see shuffle metafetchexception as mentioned in the following SO link. I have enough memory 8 nodes 8 cores 30 gig memory each. And because of this metafetchexpcetion YARN killing container running

Re: How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread Marcelo Vanzin
file can be a directory (look at all children) or even a glob (/path/*.ext, for example). On Fri, Jul 31, 2015 at 11:35 AM, swetha swethakasire...@gmail.com wrote: Hi, How to add multiple sequence files from HDFS to a Spark Context to do Batch processing? I have something like the following

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Ted Yu
Can you pastebin the complete stack trace ? If you can show skeleton of MyInputFormat and MyRecordWritable, that would provide additional information as well. Cheers On Fri, Jul 31, 2015 at 11:24 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I am having my own Hadoop custom InputFormat which I

Re: SparkLauncher not notified about finished job - hangs infinitely.

2015-07-31 Thread Elkhan Dadashov
Hi Tomasz, *Answer to your 1st question*: Clear/read the error (spark.getErrorStream()) and output (spark.getInputStream()) stream buffers before you call spark.waitFor(), it would be better to clear/read them with 2 different threads. Then it should work fine. As Spark job is launched as

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Sean Owen
If you've set the checkpoint dir, it seems like indeed the intent is to use a default checkpoint interval in DStream: private[streaming] def initialize(time: Time) { ... // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger if (mustCheckpoint

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
It looks like there's an issue with the 'Parameters' pojo I'm using within my driver program. For some reason that needs to be serializable, which is odd. java.io.NotSerializableException: com.kona.consumer.kafka.spark.Parameters Giving it another whirl though having to make it serializable

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Umesh Kacha
Hi Ted thanks much for the reply. I cant share code on public forum. I have created custom format by extending Hadoop mapred InputFormat class and same way RecordReader class. If you can help me how do I use the same in DataFrame it would be very helpful. On Sat, Aug 1, 2015 at 12:12 AM, Ted Yu

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into

RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Here is the definition of EsDoc case class EsDoc(id: Long, isExample: Boolean, docSetIds: Array[String], randomId: Double, vector: String) extends Serializable Note that it is not EsDoc having problem with registration. It is the EsDoc[] (the array class of EsDoc) that has problem with

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I'll check the log info message.. Meanwhile, the code is basically public class KafkaSparkStreamingDriver implements Serializable { .. SparkConf sparkConf = createSparkConf(appName, kahunaEnv); JavaStreamingContext jssc = params.isCheckpointed() ?

How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread unk1102
Hi I am having my own Hadoop custom InputFormat which I need to use in creating DataFrame. I tried to do the following JavaPairRDDVoid,MyRecordWritable myFormatAsPairRdd = jsc.hadoopFile(hdfs://tmp/data/myformat.xyz,MyInputFormat.class,Void.class,MyRecordWritable.class); JavaRDDMyRecordWritable

How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread swetha
Hi, How to add multiple sequence files from HDFS to a Spark Context to do Batch processing? I have something like the following in my code. Do I have to add Comma separated list of Sequence file paths to the Spark Context. val data = if(args.length0 args(0)!=null) sc.sequenceFile(file,

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Cody Koeninger
Show us the relevant code On Fri, Jul 31, 2015 at 12:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in

Re: Spark SQL DataFrame: Nullable column and filtering

2015-07-31 Thread Martin Senne
Dear Michael, dear all, a minimal example is listed below. After some further analysis I could figure out, that the problem is related to the *leftOuterJoinWithRemovalOfEqualColumn*-Method, as I use columns of the left and right dataframes when doing the select on the joined table. /** *

How to increase parallelism of a Spark cluster?

2015-07-31 Thread Sujit Pal
Hello, I am trying to run a Spark job that hits an external webservice to get back some information. The cluster is 1 master + 4 workers, each worker has 60GB RAM and 4 CPUs. The external webservice is a standalone Solr server, and is accessed using code similar to that shown below. def

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Ted Yu
I don't think using Void class is the right choice - it is not even a Writable. BTW in the future, capture text output instead of image. Thanks On Fri, Jul 31, 2015 at 12:35 PM, Umesh Kacha umesh.ka...@gmail.com wrote: Hi Ted thanks My key is always Void because my custom format file is non

What happens when you create more DStreams then nodes in the cluster?

2015-07-31 Thread Brandon White
Since one input dstream creates one receiver and one receiver uses one executor / node. What happens if you create more Dstreams than nodes in the cluster? Say I have 30 Dstreams on a 15 node cluster. Will ~10 streams get assigned to ~10 executors / nodes then the other ~20 streams will be

Re: How to create Spark DataFrame using custom Hadoop InputFormat?

2015-07-31 Thread Ted Yu
Looking closer at the code you posted, the error likely was caused by the 3rd parameter: Void.class It is supposed to be the class of key. FYI On Fri, Jul 31, 2015 at 11:24 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I am having my own Hadoop custom InputFormat which I need to use in

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Brandon White
Tathagata, Could the bottleneck possibility be the number of executor nodes in our cluster? Since we are creating 500 Dstreams based off 500 textfile directories, do we need at least 500 executors / nodes to be receivers for each one of the streams? On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das

how to convert a sequence of TimeStamp to a dataframe

2015-07-31 Thread Joanne Contact
Hi Guys, I have struggled for a while on this seeming simple thing: I have a sequence of timestamps and want to create a dataframe with 1 column. Seq[java.sql.Timestamp] //import collection.breakOut var seqTimestamp = scala.collection.Seq(listTs:_*) seqTimestamp: Seq[java.sql.Timestamp] =

Re: Setting a stage timeout

2015-07-31 Thread Ted Yu
The referenced bug has been fixed in 1.4.0, are you able to upgrade ? Cheers On Fri, Jul 31, 2015 at 10:01 AM, William Kinney william.kin...@gmail.com wrote: Hi, I had a job that got stuck on yarn due to https://issues.apache.org/jira/browse/SPARK-6954 It never exited properly. Is there

Re: SparkLauncher not notified about finished job - hangs infinitely.

2015-07-31 Thread Ted Yu
Tomasz: Please take a look at the Redirector class inside: ./launcher/src/test/java/org/apache/spark/launcher/SparkLauncherSuite.java FYI On Fri, Jul 31, 2015 at 10:02 AM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi Tomasz, *Answer to your 1st question*: Clear/read the error

Record Linkage in Spark

2015-07-31 Thread dihash
Hi Folks I would like to do RL in spark, as legacy approach using blocking query is not working as data is getting huge. Please provide necessary links and information for doing so. Thanks -- View this message in context:

Re: What happens when you create more DStreams then nodes in the cluster?

2015-07-31 Thread Ashwin Giridharan
@Brandon Each node can host multiple executors. For example, In a 15 node cluster, if your NodeManager ( In YARN) or equivalent ( MESOS or Standalone), runs on each of this node and if the node has enough resources to host say 5 executors, then in total you can have 15*5 executors and each of this

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Ashwin Giridharan
Thanks a lot @Das @Cody. I moved from receiver based to direct stream and I can get the topics from the offset!! On Fri, Jul 31, 2015 at 4:41 PM, Brandon White bwwintheho...@gmail.com wrote: Tathagata, Could the bottleneck possibility be the number of executor nodes in our cluster? Since we

Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
It doesn't make sense to me. Because in the another cluster process all data in less than a second. Anyway, I'm going to set that parameter. 2015-07-31 0:36 GMT+02:00 Tathagata Das t...@databricks.com: Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and

Spark-Submit error

2015-07-31 Thread satish chandra j
HI, I have submitted a Spark Job with options jars,class,master as *local* but i am getting an error as below *dse spark-submit spark error exception in thread main java.io.ioexception: Invalid Request Exception(Why you have not logged in)* *Note: submitting datastax spark node* please let me

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Tathagata Das
@Brandon, the file streams do not use receivers, so the bottleneck is not about executors per se. But there could be couple of bottlenecks 1. Every batch interval, the 500 dstreams are going to get directory listing from 500 directories, SEQUENTIALLY. So preparing the batch's RDDs and jobs can

Re: how to convert a sequence of TimeStamp to a dataframe

2015-07-31 Thread Ted Yu
Please take a look at stringToTimestamp() in ./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala Representing timestamp with long should work. Cheers On Fri, Jul 31, 2015 at 2:50 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi Guys, I have struggled for

Encryption on RDDs or in-memory/cache on Apache Spark

2015-07-31 Thread Matthew O'Reilly
Hi, I am currently working on the latest version of Apache Spark (1.4.1), pre-built package for Hadoop 2.6+. Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache (something similar is Altibase's HDB:  http://altibase.com/in-memory-database-computing-solutions/security/)

Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
I detected the error. The final step is to index data in ElasticSearch, The elasticSearch in one of the cluster is overhelmed and it doesn't work correctly. I linked the cluster which doesn't work with another ES and don't get any delay. Sorry, it wasn't relationed with Spark! 2015-07-31

Buffer Overflow exception

2015-07-31 Thread vinod kumar
Hi, I am getting buffer over flow exception while using spark via thrifserver base.May I know how to overcome this? Code: HqlConnection con = new HqlConnection(localhost, 10001, HiveServer.HiveServer2); con.Open(); HqlCommand createCommand = new HqlCommand(tablequery,

looking for helps in using graphx aggregateMessages

2015-07-31 Thread man june
Dear list, Hi~I am new to spark and graphx, and I have a few experiences using scala. I want to use graphx to calculate some basic statistics in linked open data, which is basically a graph.  Suppose the graph only contains one type of edge, directing from individuals to concepts, and the edge