Streaming problems running 24x7

2015-04-16 Thread Miquel
Hello, I'm finding problems to run a spark streaming job for more than a few hours (3 or 4). It begins working OK, but it degrades until failure. Some of the symptoms: - Consumed memory and CPU keeps getting higher ang higher, and finally some error is being thrown (java.lang.Exception: Could not

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I was running the spark shell and sql with --jars option containing the paths when I got my error. What is the correct way to add jars I am not sure. I tried placing the jar inside the directory you said but still get the error. I will give the code you posted a try. Thanks. -- View this

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate DStreams for each message type which you can then process in independent DAG pipelines in the following way: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message

Re: How to do dispatching in Streaming?

2015-04-16 Thread Gerard Maas
From experience, I'd recommend using the dstream.foreachRDD method and doing the filtering within that context. Extending the example of TD, something like this: dstream.foreachRDD { rdd = rdd.cache() messageType.foreach (msgTyp = val selection = rdd.filter(msgTyp.match(_))

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
I am running the queries from spark-sql. I don't think it can communicate with thrift server. Can you tell how I should run the quries to make it work. -- View this message in context:

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread bipin
Looks a good option. BTW v3.0 is round the corner. http://slick.typesafe.com/news/2015/04/02/slick-3.0.0-RC3-released.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22521.html Sent from the

Spark SQL query key/value in Map

2015-04-16 Thread jc.francisco
Hi, I'm new with both Cassandra and Spark and am experimenting with what Spark SQL can do as it will affect my Cassandra data model. What I need is a model that can accept arbitrary fields, similar to Postgres's Hstore. Right now, I'm trying out the map type in Cassandra but I'm getting the

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Ooops – what does “more work” mean in a Parallel Programming paradigm and does it always translate in “inefficiency” Here are a few laws of physics in this space: 1. More Work if done AT THE SAME time AND fully utilizes the cluster resources is a GOOD thing 2. More Work

Re: Passing Elastic Search Mappings in Spark Conf

2015-04-16 Thread Deepak Subhramanian
Thanks Nick. I understand that we can configure the index by creating the index with the mapping first. I thought it will be a good feature to be added in the es-hadoop /es-spark as we can have the full mapping and code in a single space especially for simple mappings on a particular field. It

Re: Streaming problems running 24x7

2015-04-16 Thread Akhil Das
I used to hit this issue when my processing time exceeds the batch duration. Here's a few workarounds: - Use storage level MEMORY_AND_DISK - Enable WAL and check pointing Above two will slow down things a little bit. If you want low latency, what you can try is: - Use storage level as

RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged upstream from your Spark Streaming app ie in the publishing systems and the messaging brokers) and then for each topic you can have a dedicated instance of InputReceiverDStream which will be the start of a dedicated

Re: executor failed, cannot find compute-classpath.sh

2015-04-16 Thread TimMalt
Hi, has this issue been resolved? I am currently running into similar problems. I am using spark-1.3.0-bin-hadoop2.4 on Windows and Ubuntu. I have setup all path on my Windows machine in an identical manner as on my Ubuntu server (using cygwin, so everything is somewhere under

Re: Save org.apache.spark.mllib.linalg.Matri to a file

2015-04-16 Thread Spico Florin
Thank you very much for your suggestions, Ignacio! I have posted my solution here: http://stackoverflow.com/questions/29649904/save-spark-org-apache-spark-mllib-linalg-matrix-to-a-file/29671193#29671193 Best regards, Florin On Wed, Apr 15, 2015 at 5:28 PM, Ignacio Blasco elnopin...@gmail.com

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-04-16 Thread Nathan McCarthy
Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/ I put that jar in /tmp on the driver/machine I’m running spark shell from. Then I ran with ./bin/spark-shell --jars /tmp/jtds-1.3.1.jar --master yarn-client So I’m guessing that --jars doesn’t set the class path for the

[SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com ||

custom input format in spark

2015-04-16 Thread Shushant Arora
Hi How to specify custom input format in spark and control isSplitable in between file. Need to read a file from HDFS , file format is custom and requirement is file should not be split inbetween when a executor node gets that partition of input dir. Can anyone share a sample in java. Thanks

ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread rkrist
Hello guys, after upgrading spark to 1.3.0 (and performing necessary code changes) an issue appeared making me unable to handle Date fields (java.sql.Date) with Spark SQL module. An exception appears in the console when I try to execute and SQL query on a DataFrame (see below). When I tried to

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread rkrist
...one additional note: implementation of org.apache.spark.sql.columnar.IntColumnStats is IMHO wrong. Small hint - what will be the resulting upper and lower values for column containing no data (empty RDD or null values in Int column across the whole RDD)? Shouldn't they be null? -- View

MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread sarath
Hi, I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But I'm getting java.lang.OutOfMemoryError: Java heap space error. The dataset is really sparse and have around 8 million data points and 20 million features. I'm using a cluster of 8 nodes (each with 8 cores and 64G RAM).

Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread mas
I have a big data file, i aim to create index on the data. I want to partition the data based on user defined function in Spark-GraphX (Scala). Further i want to keep track the node on which a particular data partition is send and being processed so i could fetch the required data by accessing

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
How do you intend to fetch the required data - from within Spark or using an app / code / module outside Spark -Original Message- From: mas [mailto:mas.ha...@gmail.com] Sent: Thursday, April 16, 2015 4:08 PM To: user@spark.apache.org Subject: Data partitioning and node tracking in

Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write in into a parquet file. When I run my code directly inside idea every thing works like a

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
Thanks a lot for the reply. Indeed it is useful but to be more precise i have 3D data and want to index it using octree. Thus i aim to build a two level indexing mechanism i.e. First at global level i want to partition and send the data to the nodes then at node level i again want to use octree to

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Does anybody have a solution for

Spark on Windows

2015-04-16 Thread Arun Lists
We run Spark on Mac and Linux but also need to run it on Windows 8.1 and Windows Server. We ran into problems with the Scala 2.10 binary bundle for Spark 1.3.0 but managed to get it working. However, on Mac/Linux, we are on Scala 2.11.6 (we built Spark from the sources). On Windows, however

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Ningjun, to speed up your current design you can do the following: 1.partition the large doc RDD based on the hash function on the key ie the docid 2. persist the large dataset in memory to be available for subsequent queries without reloading and repartitioning for every search query 3.

Re: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread MUHAMMAD AAMIR
I want to use Spark functions/APIs to do this task. My basic purpose is to index the data and divide and send it to multiple nodes. Then at the time of accessing i want to reach the right node and data partition. I don't have any clue how to do this. Thanks, On Thu, Apr 16, 2015 at 5:13 PM, Evo

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can use a [Key, Value] RDD and partition it based on hash function on the Key and even a specific number of partitions (and hence cluster nodes). This will a) index the data, b) divide it and send it to multiple nodes. Re your last requirement - in a cluster programming

RE: Data partitioning and node tracking in Spark-GraphX

2015-04-16 Thread Evo Eftimov
Well you can have a two level index structure, still without any need for physical cluster node awareness Level 1 Index is the previously described partitioned [K,V] RDD – this gets you to the value (RDD element) you need on the respective cluster node Level 2 Index – it will be built

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
I have a data frame in which I load data from a hive table. And my issue is that the data frame is missing the columns that I need to query. For example: val newdataset = dataset.where(dataset(label) === 1) gives me an error like the following: ERROR yarn.ApplicationMaster: User class threw

Re: MLLib SVMWithSGD : java.lang.OutOfMemoryError: Java heap space

2015-04-16 Thread Akhil Das
Try increasing your driver memory. Thanks Best Regards On Thu, Apr 16, 2015 at 6:09 PM, sarath sarathkrishn...@gmail.com wrote: Hi, I'm trying to train an SVM on KDD2010 dataset (available from libsvm). But I'm getting java.lang.OutOfMemoryError: Java heap space error. The dataset is

Custom partioner

2015-04-16 Thread Jeetendra Gangele
Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I

Re: custom input format in spark

2015-04-16 Thread Shushant Arora
Is it for spark? On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can simply override the isSplitable method in your custom inputformat class and make it return false. Here's a sample code snippet:

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Can you paste your complete code? Did you try repartioning/increasing level of parallelism to speed up the processing. Since you have 16 cores, and I'm assuming your 400k records isn't bigger than a 10G dataset. Thanks Best Regards On Thu, Apr 16, 2015 at 10:00 PM, Jeetendra Gangele

General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Hi, Is there a document/link that describes the general configuration settings to achieve maximum Spark Performance while running on CDH5? In our environment, we did lot of changes (and still doing it) to get decent performance otherwise our 6 node dev cluster with default configurations, lags

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have an RDD that contains

Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Hi All I have below code whether distinct is running for more time. blockingRdd is the combination of Long,String and it will have 400K records JavaPairRDDLong,Integer completeDataToprocess=blockingRdd.flatMapValues( new FunctionString, IterableInteger(){ @Override public IterableInteger

Re: Distinct is very slow

2015-04-16 Thread Akhil Das
Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best Regards On Thu, Apr 16, 2015 at 9:56 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have below code whether distinct is running for more time. blockingRdd is the

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
I already checked and G is taking 1 secs for each task. is this too much? if yes how to avoid this? On 16 April 2015 at 21:58, Akhil Das ak...@sigmoidanalytics.com wrote: Open the driver ui and see which stage is taking time, you can look whether its adding any GC time etc. Thanks Best

Re: Super slow caching in 1.3?

2015-04-16 Thread Christian Perez
Hi Michael, Good question! We checked 1.2 and found that it is also slow cacheing the same flat parquet file. Caching other file formats of the same data were faster by up to a factor of ~2. Note that the parquet file was created in Impala but the other formats were written by Spark SQL. Cheers,

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Well there are a number of performance tuning guidelines in dedicated sections of the spark documentation - have you read and applied them Secondly any performance problem within a distributed cluster environment has two aspects: 1. Infrastructure 2. App Algorithms You

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can simply override the isSplitable method in your custom inputformat class and make it return false. Here's a sample code snippet: http://stackoverflow.com/questions/17875277/reading-file-as-single-record-in-hadoop#answers-header Thanks Best Regards On Thu, Apr 16, 2015 at 4:18 PM,

Re: custom input format in spark

2015-04-16 Thread Akhil Das
You can plug in the native hadoop input formats with Spark's sc.newApiHadoopFile etc which takes in the inputformat. Thanks Best Regards On Thu, Apr 16, 2015 at 10:15 PM, Shushant Arora shushantaror...@gmail.com wrote: Is it for spark? On Thu, Apr 16, 2015 at 10:05 PM, Akhil Das

RE: Super slow caching in 1.3?

2015-04-16 Thread Evo Eftimov
Michael what exactly do you mean by flattened version/structure here e.g.: 1. An Object with only primitive data types as attributes 2. An Object with no more than one level of other Objects as attributes 3. An Array/List of primitive types 4. An Array/List of Objects This question is in

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Manish Gupta 8
Thanks Evo. Yes, my concern is only regarding the infrastructure configurations. Basically, configuring Yarn (Node manager) + Spark is must and default setting never works. And what really happens, is we make changes as and when an issue is faced because of one of the numerous default

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi As per JIRA this issue is resolved, but i am still facing this issue. SPARK-2734 - DROP TABLE should also uncache table -- [image: Sigmoid Analytics]

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread Denny Lee
Bummer - out of curiosity, if you were to use the classpath.first or perhaps copy the jar to the slaves could that actually do the trick? The latter isn't really all that efficient but just curious if that could do the trick. On Thu, Apr 16, 2015 at 7:14 AM ARose ashley.r...@telarix.com wrote:

Re: Microsoft SQL jdbc support from spark sql

2015-04-16 Thread ARose
I take it back. My solution only works when you set the master to local. I get the same error when I try to run it on the cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399p22525.html Sent from the

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska
Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 ​ Whereas Hive over the

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Evo Eftimov
Essentially to change the performance yield of software cluster infrastructure platform like spark you play with different permutations of: - Number of CPU cores used by Spark Executors on every cluster node - Amount of RAM allocated for each executor How disks and

saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
I am using Spark Streaming where during each micro-batch I output data to S3 using saveAsTextFile. Right now each batch of data is put into its own directory containing 2 objects, _SUCCESS and part-0. How do I output each batch into a common directory? Thanks, Vadim ᐧ

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Sean. I want to load each batch into Redshift. What's the best/most efficient way to do that? Vadim On Apr 16, 2015, at 1:35 PM, Sean Owen so...@cloudera.com wrote: You can't, since that's how it's designed to work. Batches are saved in different files, which are really directories

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
The reason for this is as follows: 1. You are saving data on HDFS 2. HDFS as a cluster/server side Service has a Single Writer / Multiple Reader multithreading model 3. Hence each thread of execution in Spark has to write to a separate file in HDFS 4. Moreover the

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Nop Sir, it is possible - check my reply earlier -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, April 16, 2015 6:35 PM To: Vadim Bichutskiy Cc: user@spark.apache.org Subject: Re: saveAsTextFile You can't, since that's how it's designed to work. Batches

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Basically you need to unbundle the elements of the RDD and then store them wherever you want - Use foreacPartition and then foreach -Original Message- From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Thursday, April 16, 2015 6:39 PM To: Sean Owen Cc:

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
Just copy the files? it shouldn't matter that much where they are as you can find them easily. Or consider somehow sending the batches of data straight into Redshift? no idea how that is done but I imagine it's doable. On Thu, Apr 16, 2015 at 6:38 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com

RE: saveAsTextFile

2015-04-16 Thread Evo Eftimov
Also to juggle even further the multithreading model of both spark and HDFS you can even publish the data from spark first to a message broker e.g. kafka from where a predetermined number (from 1 to infinity) of parallel consumers will retrieve and store in HDFS in one or more finely controlled

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Copy should be doable but I'm not sure how to specify a prefix for the directory while keeping the filename (ie part-0) fixed in copy command. On Apr 16, 2015, at 1:51 PM, Sean Owen so...@cloudera.com wrote: Just copy the files? it shouldn't matter that much where they are as you can

Random pairs / RDD order

2015-04-16 Thread abellet
Hi everyone, I have a large RDD and I am trying to create a RDD of a random sample of pairs of elements from this RDD. The elements composing a pair should come from the same partition for efficiency. The idea I've come up with is to take two random samples and then use zipPartitions to pair each

Re: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-16 Thread Sean Owen
I don't think there's anything specific to CDH that you need to know, other than it ought to set things up sanely for you. Sandy did a couple posts about tuning: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDDVendorRecord matchRdd,JavaSparkContext jsc) throws IOException{ long start = System.currentTimeMillis(); JavaPairRDDLong, MatcherReleventData RddForMarch =matchRdd.zipWithIndex().mapToPair(new

Re: saveAsTextFile

2015-04-16 Thread Sean Owen
You can't, since that's how it's designed to work. Batches are saved in different files, which are really directories containing partitions, as is common in Hadoop. You can move them later, or just read them where they are. On Thu, Apr 16, 2015 at 6:32 PM, Vadim Bichutskiy

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei On Apr 16, 2015, at 9:23 AM, Arun Lists

Re: saveAsTextFile

2015-04-16 Thread Vadim Bichutskiy
Thanks Evo for your detailed explanation. On Apr 16, 2015, at 1:38 PM, Evo Eftimov evo.efti...@isecc.com wrote: The reason for this is as follows: 1. You are saving data on HDFS 2. HDFS as a cluster/server side Service has a Single Writer / Multiple Reader multithreading

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 12:16:13PM -0700, Marcelo Vanzin wrote: I think Michael is referring to this: Exception in thread main java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] Yes, sorry, there were too many mins

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Yin Huai
Oh, just noticed that I missed attach... Yeah, your scripts will be helpful. Thanks! On Thu, Apr 16, 2015 at 12:03 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Yes, i am able to reproduce the problem. Do you need the scripts to create the tables? On Thu, Apr 16, 2015 at 10:50 PM,

dataframe call, how to control number of tasks for a stage

2015-04-16 Thread Neal Yin
I have some trouble to control number of spark tasks for a stage. This on latest spark 1.3.x source code build. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) sc.getConf.get(spark.default.parallelism) - setup to 10 val t1 = hiveContext.sql(FROM SalesJan2009 select * ) val t2 =

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Marcelo Vanzin
I think Michael is referring to this: Exception in thread main java.lang.IllegalArgumentException: You must specify at least 1 executor! Usage: org.apache.spark.deploy.yarn.Client [options] spark-submit --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.minExecutors=0

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3 https://github.com/apache/spark/commit/b2047b55c5fc85de6b63276d8ab9610d2496e08b From that commit: + private val minNumExecutors = conf.getInt(spark.dynamicAllocation.minExecutors, 0) ... + if

Re: Spark on Windows

2015-04-16 Thread Arun Lists
Thanks, Matei! We'll try that and let you know if it works. You are correct in inferring that some of the problems we had were with dependencies. We also had problems with the spark-submit scripts. I will get the details from the engineer who worked on the Windows builds and provide them to you.

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
empty folders generally means that you need to just increase the window intervals; i.e. spark streaming saveAsTxtFiles will save folders for each interval regardless On Wed, Apr 15, 2015 at 5:03 AM, Shushant Arora shushantaror...@gmail.com wrote: Its printing on console but on HDFS all

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Richard Marscher
If it fails with sbt-assembly but not without it, then there's always the likelihood of a classpath issue. What dependencies are you rolling up into your assembly jar? On Thu, Apr 16, 2015 at 4:46 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM,

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Michael Stone
On Thu, Apr 16, 2015 at 08:10:54PM +0100, Sean Owen wrote: Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. How can 0 be a fine minimum if it's rejected? Changing the value is easy enough, but in general it's nice for

Re: Distinct is very slow

2015-04-16 Thread Jeetendra Gangele
Akhil, any thought on this? On 16 April 2015 at 23:07, Jeetendra Gangele gangele...@gmail.com wrote: No I did not tried the partitioning below is the full code public static void matchAndMerge(JavaRDDVendorRecord matchRdd,JavaSparkContext jsc) throws IOException{ long start =

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores
Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Any ideas ? On Thu, Apr 16, 2015 at 5:04 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, Here is an issue that gets me mad. I wrote a UserDefineType in order to be able to store a custom type in a parquet file. In my code I just create a DataFrame with my custom data type and write

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele
Does this same functionality exist with Java? On 17 April 2015 at 02:23, Evo Eftimov evo.efti...@isecc.com wrote: You can use def partitionBy(partitioner: Partitioner): RDD[(K, V)] Return a copy of the RDD partitioned using the specified partitioner The

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Koert Kuipers
i believe it is a generalization of some classes inside graphx, where there was/is a need to keep stuff indexed for random access within each rdd partition On Thu, Apr 16, 2015 at 5:00 PM, Evo Eftimov evo.efti...@isecc.com wrote: Can somebody from Data Briks sched more light on this Indexed RDD

Re: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Michael Armbrust
Filed: https://issues.apache.org/jira/browse/SPARK-6967 Shouldn't they be null? Statistics are only used to eliminate partitions that can't possibly hold matching values. So while you are right this might result in a false positive, that will not result in a wrong answer.

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Yes, look what it was before -- would also reject a minimum of 0. That's the case you are hitting. 0 is a fine minimum. On Thu, Apr 16, 2015 at 8:09 PM, Michael Stone mst...@mathom.us wrote: On Thu, Apr 16, 2015 at 07:47:51PM +0100, Sean Owen wrote: IIRC that was fixed already in 1.3

Re: spark.dynamicAllocation.minExecutors

2015-04-16 Thread Sean Owen
Looks like that message would be triggered if spark.dynamicAllocation.initialExecutors was not set, or 0, if I read this right. Yeah, that might have to be positive. This requires you set initial executors to 1 if you want 0 min executors. Hm, maybe that shouldn't be an error condition in the args

Re: Spark SQL query key/value in Map

2015-04-16 Thread Yin Huai
For Map type column, fields['driver'] is the syntax to retrieve the map value (in the schema, you can see fields: map). The syntax of fields.driver is used for struct type. On Thu, Apr 16, 2015 at 12:37 AM, jc.francisco jc.francisc...@gmail.com wrote: Hi, I'm new with both Cassandra and Spark

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-16 Thread N B
Hi Guillaume, Interesting that you brought up Shuffle. In fact we are experiencing this issue of shuffle files being left behind and not being cleaned up. Since this is a Spark streaming application, it is expected to stay up indefinitely, so shuffle files being left is a big problem right now.

Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Yes, i am able to reproduce the problem. Do you need the scripts to create the tables? On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai yh...@databricks.com wrote: Can your code that can reproduce the problem? On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Hi

When querying ElasticSearch, score is 0

2015-04-16 Thread Andrejs Abele
Hi, I have data in my ElasticSearch server, when I query it using rest interface, I get results and score for each result, but when I run the same query in spark using ElasticSearch API, I get results and meta data, but the score is shown 0 for each record. My configuration is ... val conf = new

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
You can use def partitionBy(partitioner: Partitioner): RDD[(K, V)] Return a copy of the RDD partitioned using the specified partitioner The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool and is something which adds valuable functionality to spark e.g. the point lookups

Re: regarding ZipWithIndex

2015-04-16 Thread Jeetendra Gangele
Can you please guide me how can I extend RDD and convert into this way you are suggesting. On 16 April 2015 at 23:46, Jeetendra Gangele gangele...@gmail.com wrote: I type T i already have Object ... I have RDDObject and then I am calling ZipWithIndex on this RDD and getting RDDObject,Long on

AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Can somebody from Data Briks sched more light on this Indexed RDD library https://github.com/amplab/spark-indexedrdd It seems to come from AMP Labs and most of the Data Bricks guys are from there What is especially interesting is whether the Point Lookup (and the other primitives) can work

RE: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Evo Eftimov
Thanks but we need a firm statement and preferably from somebody from the spark vendor Data Bricks including answer to the specific question posed by me and assessment/confirmation whether this is a production ready / quality library which can be used for general purpose RDDs not just inside

MLlib - Naive Bayes Problem

2015-04-16 Thread riginos
I have a big dataset of categories of cars and descriptions of cars. So i want to give a description of a car and the program to classify the category of that car. So i decided to use multinomial naive Bayes. I created a unique id for each word and replaced my whole category,description data.

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Evo partition the large doc RDD based on the hash function on the key ie the docid What API to use to do this? By the way, loading the entire dataset to memory cause OutOfMemory problem because it is too large (I only have one machine with 16GB and 4 cores). I found something called

Re: Problem with Spark SQL UserDefinedType and sbt assembly

2015-04-16 Thread Jaonary Rabarisoa
Here is the list of my dependencies : *libraryDependencies ++= Seq(* * org.apache.spark %% spark-core % sparkVersion % provided, org.apache.spark %% spark-sql % sparkVersion, org.apache.spark %% spark-mllib % sparkVersion, org.iq80.leveldb % leveldb % 0.7, com.github.fommil.netlib

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same

mapPartitions() in Java 8

2015-04-16 Thread samirissa00
Hi , how to convert this script to java 8 with lampdas ? My problem is the function getactivations() returns a scala.collection.IteratorNode to mapPartitions() that need a java.util.IteratorString ... Thks ! === // Step 1 - Stub code to copy

Re: Random pairs / RDD order

2015-04-16 Thread Guillaume Pitel
Hi Aurelien, Sean's solution is nice, but maybe not completely order-free, since pairs will come from the same partition. The easiest / fastest way to do it in my opinion is to use a random key instead of a zipWithIndex. Of course you'll not be able to ensure uniqueness of each elements of

Re: AMP Lab Indexed RDD - question for Data Bricks AMP Labs

2015-04-16 Thread Ankur Dave
I'm the primary author of IndexedRDD. To answer your questions: 1. Operations on an IndexedRDD partition can only be performed from a task operating on that partition, since doing otherwise would require decentralized coordination between workers, which is difficult in Spark. If you want to

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-16 Thread Wang, Daoyuan
Can you tell us how did you create the dataframe? From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 17, 2015 2:52 AM To: rkrist Cc: user Subject: Re: ClassCastException processing date fields using spark SQL since 1.3.0 Filed:

Base metrics for Spark Benchmarking.

2015-04-16 Thread Bijay Pathak
Hello, We wanted to tune the Spark running on YARN cluster.The Spark History Server UI shows lots of parameters like: - GC time - Task Duration - Shuffle R/W - Shuffle Spill (Memory/Disk) - Serialization Time (Task/Result) - Scheduler Delay Among the above metrics, which are

Re: Spark on Windows

2015-04-16 Thread Stephen Boesch
The hadoop support from HortonWorks only *actually *works with Windows Server - well at least as of Spark Summit last year : and AFAIK that has not changed since 2015-04-16 15:18 GMT-07:00 Dean Wampler deanwamp...@gmail.com: If you're running Hadoop, too, now that Hortonworks supports Spark,

  1   2   >