Building RDD for a Custom MPP Database

2015-10-05 Thread VJ
Hi I have to build a RDD for a custom MPP database, which is shared across several nodes. I would like to do this using Java; Can I extend the JavaRDD and override the specific methods? Also, if can I override the getlocationPreferences methods as well? Is there any other alternatives, where I can

Pyspark 1.5.1: Error when using findSynonyms after loading Word2VecModel

2015-10-05 Thread evg952
After loading a word2vec model that I have trained and saved I get the following error when trying to use the findSynonyms function: synonyms = model.findSynonyms("pope", 20) File "/Users/edgarvelasco/minhash/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/feature.py", line 450,

Re: RDD of ImmutableList

2015-10-05 Thread Igor Berman
kryo doesn't support guava's collections by default I remember encountered project in github that fixes this(not sure though). I've ended to stop using guava collections as soon as spark rdds are concerned. On 5 October 2015 at 21:04, Jakub Dubovsky wrote: > Hi

RE: Building RDD for a Custom MPP Database

2015-10-05 Thread java8964
You want to implement a custom InputFormat for your MPP, which can provide the location preference information to Spark. Yong > Date: Mon, 5 Oct 2015 10:53:27 -0700 > From: vjan...@sankia.com > To: user@spark.apache.org > Subject: Building RDD for a Custom MPP Database > > Hi > I have to build

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Fernando Paladini
Thank you for the replies and sorry about the delay, my e-mail client send this conversation to Spam (??). I'll take a look in your tips and come back later to post my questions / progress. Again, thank you so much! 2015-09-30 18:37 GMT-03:00 Michael Armbrust : > I think

Please help: Processes with HiveContext slower in cluster

2015-10-05 Thread Saif.A.Ellafi
Hi, I have a HiveContext job which takes less than 1 minute to complete in local mode with 16 cores. However, when I launch it over stand-alone cluster, it takes for ever, probably can't even finish. Even when I have the same only node running up in which I execute it locally. How could I

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-05 Thread Umesh Kacha
Hi thanks I usually get see the following errors in Spark logs and because of that I think executor gets lost all of the following happens because huge data shuffle and I cant avoid that dont know what to do please guide 15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Dino Fancellu
Ah thanks, got it working with that. e.g. val (_,smap)=shortest.vertices.filter(_._1==src).first smap.contains(dest) Is there anything a little less eager? i.e. that doesn't compute all the distances from all source nodes, where I can supply the source vertex id, dest vertex id, and just get

Re: Exception: "You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly"

2015-10-05 Thread Ted Yu
In the tar ball, do you see any class(es) from spark-hive module ? >From the error message, I don't think so. Cheers On Mon, Oct 5, 2015 at 11:16 AM, Ahmed Cheriat wrote: > Thanks Ted for your reply. > Well, it's a stanalone spark version "spark-1.5.0-bin-hadoop2.6"

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Fernando Paladini
Update: I've updated my code and now I have the following JSON: https://gist.github.com/paladini/27bb5636d91dec79bd56 In the same link you can check the output from "spark-submit myPythonScript.py", where I call "myDataframe.show()". The following is printed by Spark (among other useless debug

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Anwar Rizal
Maybe connected component is what you need ? On Oct 5, 2015 19:02, "Robineast" wrote: > GraphX has a Shortest Paths algorithm implementation which will tell you, > for > all vertices in the graph, the shortest distance to a specific ('landmark') > vertex. The returned

RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
Hi all,   I would like to have an advice on how to use ImmutableList with RDD. Small presentation of an essence of my problem in spark-shell with guava jar added: scala> import com.google.common.collect.ImmutableList import com.google.common.collect.ImmutableList scala> val arr =

Re: RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
Thank you for quick reaction. I have to say this is very surprising to me. I never received an advice to stop using an immutable approach. Whole RDD is designed to be immutable (which is sort of sabotaged by not being able to (de)serialize immutable classes properly). I will ask on dev

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-05 Thread Davies Liu
Could you create a JIRA to track this bug? On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan wrote: > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1. > > I'm trying to read in a large quantity of json data in a couple of files and > I receive a

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX doesn't implement Tinkerpop functionality but there is an external effort to provide an implementation. See https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4279 - Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.

Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread YaoPau
I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * FROM my_db.my_tbl LIMIT 5", it scans the entire table like Hive would instead of doing a .take(5) on it and returning results immediately. Is there a way to get Spark SQL to use .take(5) instead of the Hive logic of scanning

Re: String operation in filter with a special character

2015-10-05 Thread Michael Armbrust
Double quotes (") are used to create string literals in HiveQL / Spark SQL. So you are asking if the string A+B equals the number 2.0. You should use backticks (`) to escape weird characters in column names. On Mon, Oct 5, 2015 at 12:59 AM, Hemminger Jeff wrote: > I have a

Re: Secondary Sorting in Spark

2015-10-05 Thread Adrian Tanase
Great article, especially the use of a custom partitioner. Also, sorting by multiple fields by creating a tuple out of them is an awesome, easy to miss, Scala feature. Sent from my iPhone On 04 Oct 2015, at 21:41, Bill Bejeck > wrote: I've written

Re: Broadcast var is null

2015-10-05 Thread Adrian Tanase
FYI the same happens with accumulators when recovering from checkpoint. I'd love to see this fixed somehow as the workaround (using a singleton factory in foreachRdd to make sure the accumulators are initialized instead of null) is really intrusive... Sent from my iPhone On 05 Oct 2015, at

Re: Spark context on thrift server

2015-10-05 Thread Michael Armbrust
Isolation for different sessions will hopefully be fixed by https://github.com/apache/spark/pull/8909 On Mon, Oct 5, 2015 at 8:38 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > We’re using a spark thrift server and we connect using jdbc to run queries. > > Every time

Re: "Method json([class java.util.HashMap]) does not exist" when reading JSON on PySpark

2015-10-05 Thread Michael Armbrust
Looks correct to me. Try for example: from pyspark.sql.functions import * df.withColumn("value", explode(df['values'])).show() On Mon, Oct 5, 2015 at 2:15 PM, Fernando Paladini wrote: > Update: > > I've updated my code and now I have the following JSON: >

Re: Broadcast var is null

2015-10-05 Thread Tathagata Das
Make sure the broadcast variable works independent of the streaming application. Then make sure it work without have StreamingContext.getOrCreate(). That will disambiguate whether that error is thrown when starting a new context, or when recovering a context from checkpoint (as getOrCreate is

Re: Usage of transform for code reuse between Streaming and Batch job affects the performance ?

2015-10-05 Thread Adrian Tanase
It shouldn't, as lots of the streaming operations delegate to transform under the hood. Easiest way to make sure is to look at the source code - with a decent IDE navigating around should be a breeze. As a matter of fact, for more advanced operations where you may want to control the

OutOfMemoryError

2015-10-05 Thread Ramkumar V
Hi, When i submit java spark job in cluster mode, i'm getting following exception. *LOG TRACE :* INFO yarn.ExecutorRunnable: Setting up executor with commands: List({{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx1024m, -Djava.io.tmpdir={{PWD}}/tmp,

looking for HDP users

2015-10-05 Thread Tamas Szuromi
Hello, I'm looking for someone who using hortonworks data platform especially 2.3 and also using spark 1.5.x. I have the following issue with hdp and I wanted to know is a general bug with HDP or just a local issue. https://issues.apache.org/jira/browse/SPARK-10896 Thanks in advance! *Tamas*

K-Means seems biased to one center

2015-10-05 Thread Justin Pihony
(Cross post with http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center) I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions),

String operation in filter with a special character

2015-10-05 Thread Hemminger Jeff
I have a rather odd use case. I have a DataFrame column name with a + value in it. The app performs some processing steps before determining the column name, and it would be much easier to code if I could use the DataFrame filter operations with a String. This demonstrates the issue I am having:

Re: performance difference between Thrift server and SparkSQL?

2015-10-05 Thread Jeff Thompson
Thanks for the suggestion. The output from EXPLAIN is indeed equivalent in both sparkSQL and via the Thrift server. I did some more testing. The source of the performance difference is in the way I was triggering the sparkSQL query. I was using .count() instead of .collect(). When I use

Re: OutOfMemoryError

2015-10-05 Thread Jean-Baptiste Onofré
Hi Ramkumar, did you try to increase Xmx of the workers ? Regards JB On 10/05/2015 08:56 AM, Ramkumar V wrote: Hi, When i submit java spark job in cluster mode, i'm getting following exception. *LOG TRACE :* INFO yarn.ExecutorRunnable: Setting up executor with commands:

Graphx hangs and crashes on EdgeRDD creation

2015-10-05 Thread William Saar
Hi, I am trying to run a GraphX job on 20 million edges with Spark 1.5.1, but the job seems to hang for 30 minutes on a single executor when creating the graph and eventually crashes with "IllegalArgumentException: Size exceeds Integer.MAX_VALUE" I suspect this is because of partitioning

Re: Store DStreams into Hive using Hive Streaming

2015-10-05 Thread Krzysztof Zarzycki
I'm also interested in this feature. Did you guys found some information about how to use Hive Streaming with Spark Streaming? Thanks, Krzysiek 2015-07-17 20:16 GMT+02:00 unk1102 : > Hi I have similar use case did you found solution for this problem of > loading >

Re: spark.streaming.kafka.maxRatePerPartition for direct stream

2015-10-05 Thread Tathagata Das
Also, the backpressure configuration only applies to Spark 1.5 and above. Just making that clear. On Fri, Oct 2, 2015 at 6:55 AM, Cody Koeninger wrote: > But turning backpressure on won't stop you from choking on the first batch > if you're doing e.g. some kind of in-memory

Re: question on make multiple external calls within each partition

2015-10-05 Thread Ashish Soni
Need more details but you might want to filter the data first ( create multiple RDD) and then process. > On Oct 5, 2015, at 8:35 PM, Chen Song wrote: > > We have a use case with the following design in Spark Streaming. > > Within each batch, > * data is read and

Re: Spark streaming job filling a lot of data in local spark nodes

2015-10-05 Thread Tathagata Das
You could have it. But do remember that it is brute force blunt hammer to forcefully delete everything older than the ttl. So if you are using some broadcast variable across streaming batches, that broadcasted data will get deleted as well, and jobs will start failing. You could get around that by

Re: question on make multiple external calls within each partition

2015-10-05 Thread Tathagata Das
You could create a threadpool on demand within the foreachPartitoin function, then handoff the REST calls to that threadpool, get back the futures and wait for them to finish. Should be pretty straightforward. Make sure that your foreachPartition function cleans up the threadpool before finishing.

RE: laziness in textFile reading from HDFS?

2015-10-05 Thread Mohammed Guller
Is there any specific reason for caching the RDD? How many passes you make over the dataset? Mohammed -Original Message- From: Matt Narrell [mailto:matt.narr...@gmail.com] Sent: Saturday, October 3, 2015 9:50 PM To: Mohammed Guller Cc: davidkl; user@spark.apache.org Subject: Re:

Writing UDF with variable number of arguments

2015-10-05 Thread tridib
Hi Friends, I want to write a UDF which takes variable number of arguments with varying type. myudf(String key1, String value1, String key2, int value2,) What is the best way to do it in Spark? Thanks Tridib -- View this message in context:

Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Jeff Nadler
While investigating performance challenges in a Streaming application using UpdateStateByKey, I found that serialization of state was a meaningful (not dominant) portion of our execution time. In StateDStream.scala, serialized persistence is required:

Re: Lookup / Access of master data in spark streaming

2015-10-05 Thread Tathagata Das
Yes, when old broacast objects are not referenced any more in the driver, then associated data in the driver AND the executors will get cleared. On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > @td does that mean that the "old" broadcasted data will in

Re: Broadcast var is null

2015-10-05 Thread Dmitry Pristin
Hi guys, thanks a lot for responding so quickly! I've reduced the code to the code below - no streaming, no Kafka, no checkpoint. Unfortunately the end result is the same. Any suggestion to where I'm messing up would be very much appreciated ! object BroadcastTest extends App { val logger =

save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases Spark JDBC can read data from JDBC, but can it save back to JDBC? Like to an Oracle database through its jdbc driver. Also looked at SQL Context documentation

Re: Spark thrift service and Hive impersonation.

2015-10-05 Thread Jagat Singh
Hello Steve, Thanks for confirmation. Is there any work planned work on this. Thanks, Jagat Singh On Wed, Sep 30, 2015 at 9:37 PM, Vinay Shukla wrote: > Steve is right, > The Spark thing server does not profs page end user identity downstream > yet. > > > > On

ERROR: "Size exceeds Integer.MAX_VALUE" Spark 1.5

2015-10-05 Thread Muhammad Ahsan
Hello Everyone ! I am working with spark 1.5 over YARN. I am trying something like val results = sqlContext.sql("SELECT guid FROM clickstream group by guid") results.take(10).foreach(println) But I am getting the following error. I am using data frames and unable to resolve this error, please

Re: Spark SQL "SELECT ... LIMIT" scans the entire Hive table?

2015-10-05 Thread Michael Armbrust
It does do a take. Run explain to make sure that is the case. Why do you think its reading the whole table? On Mon, Oct 5, 2015 at 1:53 PM, YaoPau wrote: > I'm using SqlCtx connected to Hive in CDH 5.4.4. When I run "SELECT * FROM > my_db.my_tbl LIMIT 5", it scans the

Re: Streaming Performance w/ UpdateStateByKey

2015-10-05 Thread Tathagata Das
You could call DStream.persist(StorageLevel.MEMORY_ONLY) on the stateDStream returned by updateStateByKey to achieve the same. As you have seen, the downside is greater memory usage, and also higher GC overheads (that;s the main one usually). So I suggest you run your benchmarks for a long enough

RE: save DF to JDBC

2015-10-05 Thread Young, Matthew T
I’ve gotten it to work with SQL Server (with limitations; it’s buggy and doesn’t work with some types/operations). https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html is the Java API you are looking for; the JDBC method lets you write to JDBC databases. I

Re: save DF to JDBC

2015-10-05 Thread Richard Hillegas
Hi Ruslan, Here is some sample code which writes a DataFrame to a table in a Derby database: import org.apache.spark.sql._ import org.apache.spark.sql.types._ val binaryVal = Array[Byte] ( 1, 2, 3, 4 ) val timestampVal = java.sql.Timestamp.valueOf("1996-01-01 03:30:36") val dateVal =

Re: spark-ec2 config files.

2015-10-05 Thread Hemminger Jeff
The spark-ec2 script generates spark config files from templates. Those are located here: https://github.com/amplab/spark-ec2/tree/branch-1.5/templates/root/spark/conf Note the link is referring to the 1.5 branch. Is this what you are looking for? Jeff On Mon, Oct 5, 2015 at 8:56 AM, Renato

Re: Store DStreams into Hive using Hive Streaming

2015-10-05 Thread Tathagata Das
Hive is not designed for OLTP workloads like data insertions and updates you want to do with Spark Streaming. Hive is mainly for OLAP workloads where you already have data and you want to run bulk queries on the data. Other systems like HBase and Cassandra are more designed for OLTP. Please think

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andrew Or
Hi all, Both the history server and the shuffle service are backward compatible, but not forward compatible. This means as long as you have the latest version of history server / shuffle service running in your cluster then you're fine (you don't need multiple of them). That said, an old shuffle

question on make multiple external calls within each partition

2015-10-05 Thread Chen Song
We have a use case with the following design in Spark Streaming. Within each batch, * data is read and partitioned by some key * forEachPartition is used to process the entire partition * within each partition, there are several REST clients created to connect to different REST services * for the

Re: RDD of ImmutableList

2015-10-05 Thread Adrian Tanase
If you don't need to write data back using that library I'd say go for #2. Convert to a scala class and standard lists, should be easier down the line. That being said, you may end up writing custom code if you stick with kryo anyway... Sent from my iPhone On 05 Oct 2015, at 21:42, Jakub

Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet? On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov wrote: > Hi, > > We're building our own framework on top of spark and we give users pretty > complex schema to work with. That requires from

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
We are running CDH 5.4 with Spark 1.3 as our main version and that version is configured to use the external shuffling service. We have also installed Spark 1.5 and have configured it not to use the external shuffling service and that works well for us so far. I would be interested myself how to

RE: No space left on device when running graphx job

2015-10-05 Thread Jack Yang
Just something usual as below: 1. Check the physical disk volume (particularly /tmp folder) 2. Use spark.local.dir to check the size of the temp files 3. Add more workers 4. Decrease partitions (in code) From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Saturday, 26

Re: spark-ec2 config files.

2015-10-05 Thread Renato Perini
Yes. Thank you Jeff, really appreciated the help. Renato. Il 06/10/2015 00:06, Hemminger Jeff ha scritto: The spark-ec2 script generates spark config files from templates. Those are located here: https://github.com/amplab/spark-ec2/tree/branch-1.5/templates/root/spark/conf Note the link is

Re: String operation in filter with a special character

2015-10-05 Thread Hemminger Jeff
Thank you! On Tue, Oct 6, 2015 at 4:50 AM, Michael Armbrust wrote: > Double quotes (") are used to create string literals in HiveQL / Spark > SQL. So you are asking if the string A+B equals the number 2.0. > > You should use backticks (`) to escape weird characters in

How can I disable logging when running local[*]?

2015-10-05 Thread Jeff Jones
I’ve written an application that hosts the Spark driver in-process using “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve also tried putting the following code prior to creating my SparkContext. These were coupled together from various posts I’ve. None of these steps

Re: ERROR: "Size exceeds Integer.MAX_VALUE" Spark 1.5

2015-10-05 Thread Anuj Kumar
You may be hitting the 2GB limit. See- https://issues.apache.org/jira/browse/SPARK-5928 https://issues.apache.org/jira/browse/SPARK-6190 https://issues.apache.org/jira/browse/SPARK-6235 Increasing the number of partitions might help-

Re: How can I disable logging when running local[*]?

2015-10-05 Thread Alex Kozlov
Did you try “--driver-java-options '-Dlog4j.configuration=file:/'” and setting the log4j.rootLogger=FATAL,console? On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones wrote: > I’ve written an application that hosts the Spark driver in-process using > “local[*]”. I’ve turned

Re: save DF to JDBC

2015-10-05 Thread Ruslan Dautkhanov
Thank you Richard and Matthew. DataFrameWriter first appeared in Spark 1.4. Sorry, I should have mentioned earlier, we're on CDH 5.4 / Spark 1.3. No options for this version? Best regards, Ruslan Dautkhanov On Mon, Oct 5, 2015 at 4:00 PM, Richard Hillegas wrote: > Hi

Re: Store DStreams into Hive using Hive Streaming

2015-10-05 Thread Umesh Kacha
Hi no didn't find any solution still I need that feature of hive streaming using Spark please let me know if you get something. Alternative solution is to use storm for hive processing. I would like to stick to Spark so still searching. On Oct 5, 2015 2:51 PM, "Krzysztof Zarzycki"

Re: Spark 1.5.0 Error on startup

2015-10-05 Thread Steve Loughran
this is windows, isn't it? you're missing a winutils.exe in the right place, and hadoop core isn't providing anything meaningful. Hadoop 2.8 will provide more diagnostics, including a link to this wiki page: https://wiki.apache.org/hadoop/WindowsProblems On 5 Oct 2015, at 05:26, Julius

StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Eugene Morozov
Hi, We're building our own framework on top of spark and we give users pretty complex schema to work with. That requires from us to build dataframes by ourselves: we transform business objects to rows and struct types and uses these two to create dataframe. Everything was fine until I started to

Spark handling parallel requests

2015-10-05 Thread tarek.abouzeid91
Hi , i am using Scala , doing a socket program to catch multiple requests at same time and then call a function which uses spark to handle each process , i have a multi-threaded server to handle the multiple requests and pass each to spark , but there's a bottleneck as the spark doesn't

Re: Spark 1.5.0 Error on startup

2015-10-05 Thread Julius Fernandes
You are right this is related to Windows. I am using Windows8. Placed the WinUtils.exe in HADOOP_HOME/bin. Issue does not occur anymore. On Mon, Oct 5, 2015 at 3:30 PM, Steve Loughran wrote: > this is windows, isn't it? you're missing a winutils.exe in the right >

[Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andreas Fritzler
Hi, I was just wondering, if it is possible to register multiple versions of the aux-services with YARN as described in the documentation: 1. In the yarn-site.xml on each node, add spark_shuffle to yarn.nodemanager.aux-services, then set

Job on Yarn not using all given capacity ends up failing

2015-10-05 Thread Cesar Berezowski
Hi, I recently upgraded from 1.2.1 to 1.3.1 (through HDP). I have a job that does a cartesian product on two datasets (2K and 500K lines minimum) to do string matching. I updated it to use Dataframes because the old code wouldn’t run anymore (deprecated RDD functions). It used to run very

Re: OutOfMemoryError

2015-10-05 Thread Ramkumar V
No. I didn't try to increase xmx. *Thanks*, On Mon, Oct 5, 2015 at 1:36 PM, Jean-Baptiste Onofré wrote: > Hi Ramkumar, > > did you try to increase Xmx of the workers ? > > Regards > JB > > On 10/05/2015 08:56 AM, Ramkumar V wrote: >

Error: could not find function "includePackage"

2015-10-05 Thread jayendra.par...@yahoo.in
As mentioned on the website that “includePackage” command can be used to include existing R packages, but when I am using this command R is giving this error :- Error: could not find function "includePackage" And there is no function called includePackage in sparkR package version 1.5.0, so how

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Steve Loughran
> On 5 Oct 2015, at 15:59, Alex Rovner wrote: > > I have the same question about the history server. We are trying to run > multiple versions of Spark and are wondering if the history server is > backwards compatible. yes, it supports the pre-1.4 "Single attempt"

Re: DStream Transformation to save JSON in Cassandra 2.1

2015-10-05 Thread Ashish Soni
try this You can use dstream.map to conver it to JavaDstream with only the data you are interested probably return an Pojo of your JSON and then call foreachRDD and inside that call below line javaFunctions(rdd).writerBuilder("table", "keyspace", mapToRow(Class.class)).saveToCassandra(); On

Re: DStream Transformation to save JSON in Cassandra 2.1

2015-10-05 Thread Jean-Baptiste Onofré
Hi Prateek, I see two ways: - using Cassandra CQL to adapt the RDD in the DStream to Cassandra - using a Cassandra converter You have a couple of code snippet in the examples. Let me know if you need a code sample. Regards JB On 10/05/2015 04:14 PM, Prateek . wrote: Hi, I am beginner in

Re: Error: could not find function "includePackage"

2015-10-05 Thread Ted Yu
includePackage is defined in R/pkg/R/context.R FYI On Mon, Oct 5, 2015 at 6:46 AM, jayendra.par...@yahoo.in < jayendra.par...@yahoo.in> wrote: > As mentioned on the website that “includePackage” command can be used to > include existing R packages, but when I am using this command R is giving >

save checkpoint during dataframe row iteration

2015-10-05 Thread Justin Permar
Good morning, I have a typical iterator loop on a DataFrame loaded from a parquet data source: val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new JavaSparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetDataFrame =

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andreas Fritzler
Hi Steve, Alex, how do you handle the distribution and configuration of the spark-*-yarn-shuffle.jar on your NodeManagers if you want to use 2 different Spark versions? Regards, Andreas On Mon, Oct 5, 2015 at 4:54 PM, Steve Loughran wrote: > > > On 5 Oct 2015, at

RE: Error: could not find function "includePackage"

2015-10-05 Thread Koen Vantomme
Verzonden vanaf mijn Sony Xperia™-smartphone jayendra.par...@yahoo.in schreef >As mentioned on the website that “includePackage” command can be used to >include existing R packages, but when I am using this command R is giving >this error :- > >Error: could not find function

DStream Transformation to save JSON in Cassandra 2.1

2015-10-05 Thread Prateek .
Hi, I am beginner in Spark , this is sample data I get from Kafka stream: {"id": "9f5ccb3d5f4f421392fb98978a6b368f","coordinate":{"ax":"1.20","ay":"3.80","az":"9.90","oa":"8.03","ob":"8.8","og":"9.97"}} val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2) val jsonf

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
I have the same question about the history server. We are trying to run multiple versions of Spark and are wondering if the history server is backwards compatible. *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * * On Mon, Oct 5, 2015 at 9:22 AM, Andreas

RE: How to install a Spark Package?

2015-10-05 Thread jeff saremi
yes those.Here's for example Avro's: spark-shell --packages com.databricks:spark-avro_2.10:2.0.1The way i read this is that this line would instruct Spark to go and get the package. But does that mean installation? Is this permament? do I need to specify it one? or each time? Will this be

Spark context on thrift server

2015-10-05 Thread Younes Naguib
Hi, We're using a spark thrift server and we connect using jdbc to run queries. Every time we run a set query, like "set schema", it seems to affect the server, and not the session only. Is that an expected behavior? Or am I missing something. Younes Naguib Triton Digital | 1440 Ste-Catherine

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Alex Rovner
Hey Steve, Are you referring to the 1.5 version of the history server? *Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * * On Mon, Oct 5, 2015 at 10:18 AM, Steve Loughran wrote: > > > On 5 Oct 2015, at 15:59, Alex Rovner

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Steve Loughran
> On 5 Oct 2015, at 16:48, Alex Rovner wrote: > > Hey Steve, > > Are you referring to the 1.5 version of the history server? > Yes. I should warn, however, that there's no guarantee that a history server running the 1.4 code will handle the histories of a 1.5+

Re: "java.io.IOException: Filesystem closed" on executors

2015-10-05 Thread Lan Jiang
I am still facing this issue. Executor dies due to org.apache.avro.AvroRuntimeException: java.io.IOException: Filesystem closed at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:278) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at

Custom RDD for Proprietary MPP database

2015-10-05 Thread VJ Anand
Hi, I need to build a RDD that supports a custom built Database (Which is sharded) across several nodes. I need to build an RDD that can support and provide the partitions specific to this database. I would like to do this in Java - I see there are JavaRDD, and other specific RDD available - my

Exception: "You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly"

2015-10-05 Thread cherah30
I work with Spark 1.5 on windows 7, with anacond and pyspark. everything works fine until I wanted to test the connection to my MySQL database. So I started watching it https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases

Re: GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Robineast
GraphX has a Shortest Paths algorithm implementation which will tell you, for all vertices in the graph, the shortest distance to a specific ('landmark') vertex. The returned value is '/a graph where each vertex attribute is a map containing the shortest-path distance to each reachable landmark

Spark metrics cpu/memory

2015-10-05 Thread gtanguy
I would like to monitor cpu/memory usage. I read the section Metrics on : http://spark.apache.org/docs/1.3.1/monitoring.html. Here my $SPARK_HOME/conf/metrics.properties # Enable CsvSink for all instances *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink # Polling period for CsvSink

Utility for PySpark DataFrames - smartframes

2015-10-05 Thread Don Drake
I would like to announce a Python package that makes creating rows in DataFrames in PySpark as easy as creating an object. Code is available on GitHub, PyPi, and soon to be on spark-packages.org. https://github.com/dondrake/smartframes Motivation Spark DataFrames provide a nice interface to

Re: Spark on YARN using Java 1.8 fails

2015-10-05 Thread Ted Yu
YARN 2.7.1 (running on the cluster) was built with Java 1.8, I assume. Have you used the following command to retrieve / inspect logs ? yarn logs -applicationId Cheers On Mon, Oct 5, 2015 at 8:41 AM, mvle wrote: > Hi, > > I have successfully run pyspark on Spark 1.5.1 on YARN

GraphX: How can I tell if 2 nodes are connected?

2015-10-05 Thread Dino Fancellu
Is there an existing api to see if 2 nodes in a graph are connected? e.g. a->b, b->c, c->d can I get to d, starting from a? (yes I hope!) I'm not asking the route, just want to know if there is a route. Thanks. -- View this message in context:

Re: Exception: "You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly"

2015-10-05 Thread Ted Yu
What command did you use to build Spark 1.5.0 ? bq. Export 'SPARK_HIVE=true' and run build/sbt assembly Please following the above. BTW 1.5.1 has been released which is more stable. Please use 1.5.1 Cheers On Mon, Oct 5, 2015 at 9:25 AM, cherah30 wrote: > I work

Where to put import sqlContext.implicits._ to be able to work on DataFrames in another file?

2015-10-05 Thread Kristina Rogale Plazonic
Hi all, I have a Scala project with multiple files: a main file and a file with utility functions on DataFrames. However, using $"colname" to refer to a column of the DataFrame in the utils file (see code below) produces a compile-time error as follows: "value $ is not a member of StringContext"

How to change verbosity level and redirect verbosity to file?

2015-10-05 Thread Saif.A.Ellafi
Hi, I would like to read the full spark-submit log once a job is completed, since the output is not stdout, how could I redirect spark output to a file in linux? Thanks, Saif

Spark Survey Results 2015 are now available

2015-10-05 Thread Denny Lee
Thanks to all of you who provided valuable feedback in our Spark Survey 2015. Because of the survey, we have a better picture of who’s using Spark, how they’re using it, and what they’re using it to build–insights that will guide major updates to the Spark platform as we move into Spark’s next

Spark on YARN using Java 1.8 fails

2015-10-05 Thread mvle
Hi, I have successfully run pyspark on Spark 1.5.1 on YARN 2.7.1 with Java OpenJDK 1.7. However, when I run the same test on Java OpenJDK 1.8 (or Oracle Java 1.8), I cannot start up pyspark. Has anyone been able to run Spark on YARN with Java 1.8? I get ApplicationMaster disassociated