Spark does not delete temporary directories

2015-05-07 Thread Taeyun Kim
Hi, After a spark program completes, there are 3 temporary directories remain in the temp directory. The file names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7 And the Spark program runs on Windows, a snappy DLL file also remains in the temp directory. The file name is like

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread ayan guha
2*2 cents 1. You can try repartition and give a large number to achieve smaller partitions. 2. OOM errors can be avoided by increasing executor memory or using off heap storage 3. How are you persisting? You can try using persist using DISK_ONLY_SER storage level 4. You may take a look in the

Re: Parquet number of partitions

2015-05-07 Thread Archit Thakur
Hi. No. of partitions are determined by the RDD it uses in the plan it creates. It uses NewHadoopRDD which gives partitions by getSplits of input format it is using. It uses FilteringParquetRowInputFormat subclass of ParquetInputFormat. To change the no of partitions write a new input format and

Spark Job triggers second attempt

2015-05-07 Thread ๏̯͡๏
How i can stop Spark to stop triggering second attempt in case the first fails. I do not want to wait for the second attempt to fail again so that i can debug faster. .set(spark.yarn.maxAppAttempts, 0) OR .set(spark.yarn.maxAppAttempts, 1) is not helping. -- Deepak

Re: Error in SparkSQL/Scala IDE

2015-05-07 Thread Iulian Dragoș
On Thu, May 7, 2015 at 10:18 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: Got it! I'll open a Jira ticket and PR when I have a working solution. Scratch that, I found SPARK-5281 https://issues.apache.org/jira/browse/SPARK-5281.. On Wed, May 6, 2015 at 11:53 PM, Michael Armbrust

Re: Error in SparkSQL/Scala IDE

2015-05-07 Thread Iulian Dragoș
Got it! I'll open a Jira ticket and PR when I have a working solution. On Wed, May 6, 2015 at 11:53 PM, Michael Armbrust mich...@databricks.com wrote: Hi Iulian, The relevant code is in ScalaReflection

User Defined Type (UDT)

2015-05-07 Thread Wojtek Jurczyk
Hi all! I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for a project I'm working on. I've created a case class Person(name: String) and now I'm trying to make Spark to be able serialize and deserialize the defined type. I made a couple of attempts but none of them did

Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
The multi-threading code in Scala is quite simple and you can google it pretty easily. We used the Future framework. You can use Akka also. @Evo My concerns for filtering solution are: 1. Will rdd2.filter run before rdd1.filter finish? 2. We have to traverse rdd twice. Any comments? On

history server

2015-05-07 Thread Koert Kuipers
i am trying to launch the spark 1.3.1 history server on a secure cluster. i can see in the logs that it successfully logs into kerberos, and it is replaying all the logs, but i never see the log message that indicate the web server is started (i should see something like Successfully started

Re: saveAsTable fails on Python with Unresolved plan found

2015-05-07 Thread Michael Armbrust
Sorry for the confusion. SQLContext doesn't have a persistent metastore so its not possible to save data as a table. If anyone wants to contribute, I'd welcome a new query planner strategy for SQLContext that gave a better error message. On Thu, May 7, 2015 at 8:41 AM, Judy Nash

Re: Avro to Parquet ?

2015-05-07 Thread Michael Armbrust
Spark SQL using the Data Source API can also do this with much less code https://twitter.com/michaelarmbrust/status/579346328636891136. https://github.com/databricks/spark-avro On Thu, May 7, 2015 at 8:41 AM, Jonathan Coveney jcove...@gmail.com wrote: A helpful example of how to convert:

Re: AvroFiles

2015-05-07 Thread Michael Armbrust
I would suggest also looking at: https://github.com/databricks/spark-avro On Wed, May 6, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello, This is how i read Avro data. import org.apache.avro.generic.GenericData import org.apache.avro.generic.GenericRecord import

Re: history server

2015-05-07 Thread Shixiong Zhu
The history server may need several hours to start if you have a lot of event logs. Is it stuck, or still replaying logs? Best Regards, Shixiong Zhu 2015-05-07 11:03 GMT-07:00 Marcelo Vanzin van...@cloudera.com: Can you get a jstack for the process? Maybe it's stuck somewhere. On Thu, May 7,

Re: history server

2015-05-07 Thread Shixiong Zhu
SPARK-5522 is really cool. Didn't notice it. Best Regards, Shixiong Zhu 2015-05-07 11:36 GMT-07:00 Marcelo Vanzin van...@cloudera.com: That shouldn't be true in 1.3 (see SPARK-5522). On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote: The history server may need several

Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill, I just found weird that one would use parallel threads to 'filter', as filter is lazy in Spark, and multithreading wouldn't have any effect unless the action triggering the execution of the lineage containing such filter is executed on a separate thread. One must have very specific

AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-07 Thread in4maniac
Hi Guys, I think this problem is related to : http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html I am running pyspark 1.2.1 in AWS with my AWS credentials exported to master node as Environmental Variables. Halfway through my application, I

Re: history server

2015-05-07 Thread Marcelo Vanzin
That shouldn't be true in 1.3 (see SPARK-5522). On Thu, May 7, 2015 at 11:33 AM, Shixiong Zhu zsxw...@gmail.com wrote: The history server may need several hours to start if you have a lot of event logs. Is it stuck, or still replaying logs? Best Regards, Shixiong Zhu 2015-05-07 11:03

Re: history server

2015-05-07 Thread Koert Kuipers
seems i got one thread spinning 100% for a while now, in FsHistoryProvider.initialize(). maybe something wrong with my logs on hdfs that its reading? or could it simply really take 30 mins to read all the history on dhfs? jstack: Deadlock Detection: No deadlocks found. Thread 2272: (state =

Re: Spark unit test fails

2015-05-07 Thread NoWisdom
I'm also getting the same error. Any ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-unit-test-fails-tp22368p22798.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: history server

2015-05-07 Thread Koert Kuipers
got it. thanks! On Thu, May 7, 2015 at 2:52 PM, Marcelo Vanzin van...@cloudera.com wrote: Ah, sorry, that's definitely what Shixiong mentioned. The patch I mentioned did not make it into 1.3... On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote: seems i got one thread

Re: history server

2015-05-07 Thread Ankur Chauhan
Hi, Sorry this may be a little off topic but I tried searching for docs on history server but couldn't really find much. Can someone point me to a doc or give me a point of reference for the use and intent of a history server? -- Ankur On 7 May 2015, at 12:06, Koert Kuipers

Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD call. After the input format runs, I want to do some directory cleanup and I want to block while I'm doing that. Is that something I can do inside of this function? If not, where would I accomplish this on every

Predict.scala using model for clustering In reference

2015-05-07 Thread anshu shukla
Can anyone please explain - println(Initalizaing the the KMeans model...) val model = new KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect()) where modelfile is *directory to persist the model while training * REF-

Re: history server

2015-05-07 Thread Marcelo Vanzin
Ah, sorry, that's definitely what Shixiong mentioned. The patch I mentioned did not make it into 1.3... On Thu, May 7, 2015 at 11:48 AM, Koert Kuipers ko...@tresata.com wrote: seems i got one thread spinning 100% for a while now, in FsHistoryProvider.initialize(). maybe something wrong with my

Re: Spark Job triggers second attempt

2015-05-07 Thread Doug Balog
I bet you are running on YARN in cluster mode. If you are running on yarn in client mode, .set(“spark.yarn.maxAppAttempts”,”1”) works as you expect, because YARN doesn’t start your app on the cluster until you call SparkContext(). But If you are running on yarn in cluster mode, the driver

Re: history server

2015-05-07 Thread Marcelo Vanzin
Can you get a jstack for the process? Maybe it's stuck somewhere. On Thu, May 7, 2015 at 11:00 AM, Koert Kuipers ko...@tresata.com wrote: i am trying to launch the spark 1.3.1 history server on a secure cluster. i can see in the logs that it successfully logs into kerberos, and it is

Re: Map one RDD into two RDD

2015-05-07 Thread anshu shukla
One of the best discussion in mailing list :-) ...Please help me in concluding -- The whole discussion concludes that - 1- Framework does not support increasing parallelism of any task just by any inbuilt function . 2- User have to manualy write logic for filter output of upstream node

Virtualenv pyspark

2015-05-07 Thread alemagnani
I am currently using pyspark with a virtualenv. Unfortunately I don't have access to the nodes file system and therefore I cannot manually copy the virtual env over there. I have been using this technique: I first add a tar ball with the venv sc.addFile(virtual_env_tarball_file) Then in

回复:RE: 回复:回复:RE: 回复:Re:_sparksql_running_slow_while_joining_2_tables.

2015-05-07 Thread luohui20001
I checked the data again, no skewed data in it, it is just txt files with sereral string and int fields. that's it. I also followed the suggestions in tuning guild page, refer to http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning I will keep on inspecting why those left

Re: Stop Cluster Mode Running App

2015-05-07 Thread Silvio Fiorito
Hi James, If you’re on Spark 1.3 you can use the kill command in spark-submit to shut it down. You’ll need the driver id from the Spark UI or from when you submitted the app. spark-submit --master spark://master:7077 --kill driver-id Thanks, Silvio From: James King Date: Wednesday, May 6,

RE: Spark does not delete temporary directories

2015-05-07 Thread Taeyun Kim
Thanks, but it seems that the option is for Spark standalone mode only. I’ve (lightly) tested the options with local mode and yarn-client mode, the ‘temp’ directories were not deleted. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, May 07, 2015 10:47 PM To: Todd Nist Cc: Taeyun

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to modelFile as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40 PM, anshu shukla

Getting data into Spark Streaming

2015-05-07 Thread Sathaye
Hi I am pretty new to spark and I am trying to implement a simple spark streaming application using Meetup's RSVP stream: stream.meetup.com/2/rsvps Any idea how to connect the stream to Spark Streaming? I am trying out rawSocketStream but not sure what the parameters are(viz. port) Thank you

Duplicate entries in output of mllib column similarities

2015-05-07 Thread rbolkey
Hi, I have a question regarding one of the oddities we encountered while running mllib's column similarities operation. When we examine the output, we find duplicate matrix entries (the same i,j). Sometimes the entries have the same value/similarity score, but they're frequently different too.

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-07 Thread Night Wolf
MyClass is a basic scala case class (using Spark 1.3.1); case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int, ipi: Double) { override def hashCode(): Int = crn.hashCode() } On Wed, May 6, 2015 at 8:09 PM, ayan guha guha.a...@gmail.com wrote: How does your MyClqss looks like?

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread yana
I believe this is a regression. Does not work for me either. There is a Jira on parquet wildcards which is resolved, I'll see about getting it reopened Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: Vaxuki vax...@gmail.com

update resource when running spark

2015-05-07 Thread Hoai-Thu Vuong
Hi all, I use a function to create or return context for spark application, in this function I load some resources from text file to a list. My question is how to update a list?

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Vaxuki
Olivier Nope. Wildcard extensions don't work I am debugging the code to figure out what's wrong I know I am using 1.3.1 for sure Pardon typos... On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote: hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7

Re: How can I force operations to complete and spool to disk

2015-05-07 Thread Steve Lewis
I give the executor 14gb and would like to cut it. I expect the critical operations to run hundreds of millions of times which is why we run on a cluster. I will try DISK_ONLY_SER Thanks Steven Lewis sent from my phone On May 7, 2015 10:59 AM, ayan guha guha.a...@gmail.com wrote: 2*2 cents 1.

Re: Parquet number of partitions

2015-05-07 Thread Eric Eijkelenboom
Funny enough, I observe different behaviour on EC2 vs EMR (Spark on EMR installed with https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark). Both with Spark 1.3.1/Hadoop 2. Reading a folder with 12 Parquet gives

Re: Duplicate entries in output of mllib column similarities

2015-05-07 Thread Reza Zadeh
This shouldn't be happening, do you have an example to reproduce it? On Thu, May 7, 2015 at 4:17 PM, rbolkey rbol...@gmail.com wrote: Hi, I have a question regarding one of the oddities we encountered while running mllib's column similarities operation. When we examine the output, we find

Is it possible to set the akka specify properties (akka.extensions) in spark

2015-05-07 Thread Terry Hole
Hi all, I'd like to monitor the akka using kamon, which need to set the akka.extension to a list like this in typesafe config format: akka { extensions = [kamon.system.SystemMetrics, kamon.statsd.StatsD] } But i can not find a way to do this, i have tried these: 1.

Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread felicia
Hi all, Thanks for the help on this case! we finally settle this by adding a jar named: parquet-hive-bundle-1.5.0.jar when submitting jobs through spark-submit, where this jar file does not exist in our CDH5.3 anyway (we've downloaded it from

Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread Marcelo Vanzin
On Thu, May 7, 2015 at 7:39 PM, felicia shsh...@tsmc.com wrote: we tried to add /usr/lib/parquet/lib /usr/lib/parquet to SPARK_CLASSPATH and it doesn't seems to work, To add the jars to the classpath you need to use /usr/lib/parquet/lib/*, otherwise you're just adding the directory (and not

Re: Spark does not delete temporary directories

2015-05-07 Thread Sean Owen
You're referring to a comment in the generic utility method, not the specific calls to it. The comment just says that the generic method doesn't mark the directory for deletion. Individual uses of it might need to. One or more of these might be delete-able on exit, but in any event it's just a

Discretization

2015-05-07 Thread spark_user_2015
The Spark documentation shows the following example code: // Discretize data in 16 equal bins since ChiSqSelector requires categorical features val discretizedData = data.map { lp = LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16 } ) ) } I'm sort of missing why x / 16

(无主题)

2015-05-07 Thread luohui20001
Hi guys, I got a PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered FORMAT at line 21, column 141. when creating a table by using ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'. As I remember, previous version phoenix support this grammar,

SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2015-05-07 Thread felicia
Hi all, I'm able to run SparkSQL through python/java and retrieve data from ordinary table, but when trying to fetch data from parquet table, following error shows up:\ which is pretty straight-forward indicating that parquet-related class was not found; we tried to add /usr/lib/parquet/lib

Possible long lineage issue when using DStream to update a normal RDD

2015-05-07 Thread yaochunnan
Hi all, Recently in our project, we need to update a RDD using data regularly received from DStream, I plan to use foreachRDD API to achieve this: var MyRDD = ... dstream.foreachRDD { rdd = MyRDD = MyRDD.join(rdd)... ... } Is this usage correct? My concern is, as I am repeatedly and

RE: Spark does not delete temporary directories

2015-05-07 Thread Haopu Wang
I think the temporary folders are used to store blocks and shuffles. That doesn't depend on the cluster manager. Ideally they should be removed after the application has been terminated. Can you check if there are contents under those folders? From: Taeyun

回复:回复:回复:RE: 回复:回复:RE: 回复:Re:_sparksql_running_slow_while_joining_2_tables.

2015-05-07 Thread luohui20001
I tried small spark.sql.shuffle.partitions = 16,so that every task will fetch generally equal size of data,however every task runs still slow. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:luohui20...@sina.com 收件人:luohui20001 luohui20...@sina.com,

RE: YARN mode startup takes too long (10+ secs)

2015-05-07 Thread Taeyun Kim
I think I’ve found the (maybe partial, but major) reason. It’s between the following lines, (it’s newly captured, but essentially the same place that Zoltán Zvara picked: 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager 15/05/08 11:36:38 INFO YarnClientSchedulerBackend:

saveAsTable fails on Python with Unresolved plan found

2015-05-07 Thread Judy Nash
Hello, I am following the tutorial code on sql programming guidehttps://spark.apache.org/docs/1.2.1/sql-programming-guide.html#inferring-the-schema-using-reflection to try out Python on spark 1.2.1. SaveAsTable function works on Scala bur fails on python with Unresolved plan found. Broken

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-05-07 Thread allonsy
Has there been any follow up on this topic? Here http://search-hadoop.com/m/q3RTtm4EtI1hoD8d2 there were suggestions that someone was going to publish some code, but no news since (TD himself looked pretty interested). Did anybody come up with something in the last months? -- View this

User Defined Type (UDT)

2015-05-07 Thread wjur
Hi all! I'm using Spark 1.3.0 and I'm struggling with a definition of a new type for a project I'm working on. I've created a case class Person(name: String) and now I'm trying to make Spark to be able serialize and deserialize the defined type. I made a couple of attempts but none of them did

Re: Spark does not delete temporary directories

2015-05-07 Thread Ted Yu
Default value for spark.worker.cleanup.enabled is false: private val CLEANUP_ENABLED = conf.getBoolean(spark.worker.cleanup.enabled, false) I wonder if the default should be set as true. Cheers On Thu, May 7, 2015 at 6:19 AM, Todd Nist tsind...@gmail.com wrote: Have you tried to set the

Re: Spark does not delete temporary directories

2015-05-07 Thread Todd Nist
Have you tried to set the following? spark.worker.cleanup.enabled=true spark.worker.cleanup.appDataTtl=seconds” On Thu, May 7, 2015 at 2:39 AM, Taeyun Kim taeyun@innowireless.com wrote: Hi, After a spark program completes, there are 3 temporary directories remain in the temp

Re: Sort (order by) of the big dataset

2015-05-07 Thread Ted Yu
Where can I find your blog ? Thanks On Apr 29, 2015, at 7:14 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: After day of debugging (actually, more), I can answer my question: The problem is that the default value 200 of “spark.sql.shuffle.partitions” is too small for sorting 2B

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread in4maniac
Hi V, I am assuming that each of the three .parquet paths you mentioned have multiple partitions in them. For eg: [/dataset/city=London/data.parquet/part-r-0.parquet, /dataset/city=London/data.parquet/part-r-1.parquet] I haven't personally used this with hdfs, but I've worked with a similar

Re: question about the TFIDF.

2015-05-07 Thread ai he
Hi Dan, In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala, you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate a term. It's also mentioned in the doc that Maps a sequence of terms to their term frequencies

Re: Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
It does look the function that's executed is in the driver so doing an Await.result() on a thread AFTER i've executed an action should work. Just updating this here in case anyone has this question in the future. Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD

RE: saveAsTable fails on Python with Unresolved plan found

2015-05-07 Thread Judy Nash
SPARK-4825https://issues.apache.org/jira/browse/SPARK-4825 looks like the right bug, but it should've been fixed on 1.2.1. Is a similar fix needed in Python? From: Judy Nash Sent: Thursday, May 7, 2015 7:26 AM To: user@spark.apache.org Subject: saveAsTable fails on Python with Unresolved plan

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
By default you would expect to find the logs files for master and workers in the relative `logs` directory from the root of the Spark installation on each of the respective nodes in the cluster. On Thu, May 7, 2015 at 10:27 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: Ø Can

Re: Avro to Parquet ?

2015-05-07 Thread Jonathan Coveney
A helpful example of how to convert: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/ As far as performance, that depends on your data. If you have a lot of columns and use all of them, parquet deserialization is expensive. If you have a column and only need a few

Re: Map one RDD into two RDD

2015-05-07 Thread Bill Q
Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be working. Any comments? On Wednesday, May 6, 2015, Evo Eftimov evo.efti...@isecc.com wrote: RDD1 = RDD.filter() RDD2 = RDD.filter() *From:*

Re: branch-1.4 scala 2.11

2015-05-07 Thread Iulian Dragoș
There's an open PR to fix it: https://github.com/apache/spark/pull/5966 On Thu, May 7, 2015 at 6:07 PM, Koert Kuipers ko...@tresata.com wrote: i am having no luck using the 1.4 branch with scala 2.11 $ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package [error]

RE: saveAsTable fails on Python with Unresolved plan found

2015-05-07 Thread Judy Nash
Figured it out. It was because I was using HiveContext instead of SQLContext. FYI in case others saw the same issue. From: Judy Nash Sent: Thursday, May 7, 2015 7:38 AM To: 'user@spark.apache.org' Subject: RE: saveAsTable fails on Python with Unresolved plan found

branch-1.4 scala 2.11

2015-05-07 Thread Koert Kuipers
i am having no luck using the 1.4 branch with scala 2.11 $ build/mvn -DskipTests -Pyarn -Dscala-2.11 -Pscala-2.11 clean package [error] /home/koert/src/opensource/spark/core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala:78: in object RDDOperationScope, multiple overloaded

Re: Spark Job triggers second attempt

2015-05-07 Thread Richard Marscher
Hi, I think you may want to use this setting?: spark.task.maxFailures4Number of individual task failures before giving up on the job. Should be greater than or equal to 1. Number of allowed retries = this value - 1. On Thu, May 7, 2015 at 2:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: How

Re: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Richard Marscher
I should also add I've recently seen this issue as well when using collect. I believe in my case it was related to heap space on the driver program not being able to handle the returned collection. On Thu, May 7, 2015 at 11:05 AM, Richard Marscher rmarsc...@localytics.com wrote: By default you

RandomSplit with Spark-ML and Dataframe

2015-05-07 Thread Olivier Girardot
Hi, is there any best practice to do like in MLLib a randomSplit of training/cross-validation set with dataframes and the pipeline API ? Regards Olivier.

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928 Looks like for now you'd have to list the full paths...I don't see a comment from an official spark committer so still not sure if this is a bug or design, but it seems to be the current state of affairs. On Thu, May 7, 2015 at

RE: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0

2015-05-07 Thread Wang, Ningjun (LNG-NPV)
Ø Can you check your local and remote logs? Where are the log files? I see the following in my Driver program logs as well as the Spark UI failed task page java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0 of broadcast_2 Here is the detailed stack trace.

Avro to Parquet ?

2015-05-07 Thread ๏̯͡๏
1) What is the best way to convert data from Avro to Parquet so that it can be later read and processed ? 2) Will the performance of processing (join, reduceByKey) be better if both datasets are in Parquet format when compared to Avro + Sequence ? -- Deepak

Re: Map one RDD into two RDD

2015-05-07 Thread Gerard Maas
Hi Bill, Could you show a snippet of code to illustrate your choice? -Gerard. On Thu, May 7, 2015 at 5:55 PM, Bill Q bill.q@gmail.com wrote: Thanks for the replies. We decided to use concurrency in Scala to do the two mappings using the same source RDD in parallel. So far, it seems to be

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-05-07 Thread Ravi Mody
After thinking about it more, I do think weighting lambda by sum_i cij is the equivalent of the ALS-WR paper's approach for the implicit case. This provides scale-invariance for varying products/users and for varying ratings, and should behave well for all alphas. What do you guys think? On Wed,

RE: Map one RDD into two RDD

2015-05-07 Thread Evo Eftimov
Scala is a language, Spark is an OO/Functional, Distributed Framework facilitating Parallel Programming in a distributed environment Any “Scala parallelism” occurs within the Parallel Model imposed by the Spark OO Framework – ie it is limited in terms of what it can achieve in terms of

RE: Sort (order by) of the big dataset

2015-05-07 Thread Ulanov, Alexander
The answer for Spark SQL “order by” is setting spark.sql.shuffle.partitions to a bigger number. For RDD.sortBy it works out of the box if RDD has enough number of partitions. From: Night Wolf [mailto:nightwolf...@gmail.com] Sent: Thursday, May 07, 2015 5:26 AM To: Ulanov, Alexander Cc:

RE: Sort (order by) of the big dataset

2015-05-07 Thread Ulanov, Alexander
avulanov.blogspot.com, though it does not have more on this particular issue than I already posted. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Thursday, May 07, 2015 6:25 AM To: Ulanov, Alexander Cc: user@spark.apache.org Subject: Re: Sort (order by) of the big dataset Where can I find

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Olivier Girardot
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet

Not maximum CPU usage

2015-05-07 Thread Krever
I have a small application configured to use 6 cpu cores and I run it on standalone cluster. Such configuration means that only 6 task can be active in one moment and if all of them are waitng(IO for example) then not whole CPU is used. My questions: 1. Is it true that number of active tasks per