Re: FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari
I have been giving it 8-12G -Raj Sent from my iPhone > On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan > wrote: > > How much RAM are you giving to the driver? 17000 items being collected > shouldn't fail unless your driver memory is too low. > > Regards >

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Richard, > Would it be possible to access the session API from within ROSE, > to get for example the images that are generated by R / openCPU Technically it would be possible although there would be some potentially significant runtime costs per task in doing so, primarily those related to

Re: PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar
Any suggestion/opinion? On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar" wrote: > We're running PCA (selecting 100 principal components) on a dataset that > has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The > matrix in question is mostly sparse with tens of

Re: failure to parallelize an RDD

2016-01-12 Thread Ted Yu
Which release of Spark are you using ? Can you turn on DEBUG logging to see if there is more clue ? Thanks On Tue, Jan 12, 2016 at 6:37 PM, AlexG wrote: > I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array > of > rows in Array[Array[Float]] format

RE: spark job failure - akka error Association with remote system has failed

2016-01-12 Thread vivek.meghanathan
I have used master_ip as ip address and spark conf also has Ip address . But the following logs shows hostname. (The spark Ui shows master details in IP) 16/01/13 12:31:38 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@masternode1:36537] has failed,

RE: Dedup

2016-01-12 Thread gpmacalalad
sowen wrote > Arrays are not immutable and do not have the equals semantics you want to > use them as a key. Use a Scala immutable List. > On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)" > yge@ > wrote: > >> Yes. I was using String array as arguments in the reduceByKey. I think >> String array is

failure to parallelize an RDD

2016-01-12 Thread AlexG
I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array of rows in Array[Array[Float]] format into another matrix (rowChunk) also stored row-wise as a 54843210-by-200 Array[Array[Float]] using the following code: val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)

hiveContext.sql() - Join query fails silently

2016-01-12 Thread Jins George
Hi All, I have used a hiveContext.sql() to join a temporary table created from Dataframe and parquet tables created in Hive. The join query runs fine for few hours and then suddenly fails to do the Join. Once the issue happens the dataframe returned from hiveContext.sql() is empty. If I

ml.classification.NaiveBayesModel how to reshape theta

2016-01-12 Thread Andy Davidson
I am trying to debug my trained model by exploring theta Theta is a Matrix. The java Doc for Matrix says that it is column major formate I have trained a NaiveBayesModel. Is the number of classes == to the number of rows? int numRows = nbModel.numClasses(); int numColumns =

Re:

2016-01-12 Thread Sabarish Sasidharan
You could generate as many duplicates with a tag/sequence. And then use a custom partitioner that uses that tag/sequence in addition to the key to do the partitioning. Regards Sab On 12-Jan-2016 12:21 am, "Daniel Imberman" wrote: > Hi all, > > I'm looking for a way to

spark job failure - akka error Association with remote system has failed

2016-01-12 Thread vivek.meghanathan
Hi All, I am running spark 1.3.0 standalone cluster mode, we have rebooted the cluster servers (system reboot). After that the spark jobs are failing by showing following error (it fails within 7-8 seconds). 2 of the jobs are running fine. All the jobs used to be stable before the system

Serializing DataSets

2016-01-12 Thread Simon Hafner
What's the proper way to write DataSets to disk? Convert them to a DataFrame and use the writers there? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: failure to parallelize an RDD

2016-01-12 Thread Alex Gittens
I'm using Spark 1.5.1 When I turned on DEBUG, I don't see anything that looks useful. Other than the INFO outputs, there is a ton of RPC message related logs, and this bit: 16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.rdd.RDD$$anonfun$count$1) +++ 16/01/13

Re: JMXSink for YARN deployment

2016-01-12 Thread Kyle Lin
Hello there I run both driver and master on the same node, so I got "Port already in use" exception. Is there any solution to set different port for each component? Kyle 2015-12-05 5:54 GMT+08:00 spearson23 : > Run "spark-submit --help" to see all available options. >

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana
The code is very simple, pasted below . hive-site.xml is in spark conf already. I still see this error Error in writeJobj(con, object) : invalid jobj 3 after running the script below script === Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana
Complete stacktrace is. Can it be something wih java versions? stop("invalid jobj ", value$id) 8 writeJobj(con, object) 7 writeObject(con, a) 6 writeArgs(rc, args) 5 invokeJava(isStatic = TRUE, className, methodName, ...) 4 callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,

model deployment in spark

2016-01-12 Thread Chandan Verma
Does spark support model deployment through web/rest api. Is there any other method for deployment of a predictive model in spark apart from PMML. Regards, Chandan

Re: JMXSink for YARN deployment

2016-01-12 Thread Kyle Lin
Hello guys I got a solution. I set -Dcom.sun.management.jmxremote.port=0 to let system assign a unused port. Kyle 2016-01-12 16:54 GMT+08:00 Kyle Lin : > Hello there > > > I run both driver and master on the same node, so I got "Port already in > use" exception. > > Is

Using lz4 in Kafka seems to be broken by jpountz dependency upgrade in Spark 1.5.x+

2016-01-12 Thread Stefan Schadwinkel
Hi all, we'd like to upgrade one of our Spark jobs from 1.4.1 to 1.5.2 (we run Spark on Amazon EMR). The job consumes and pushes lz4 compressed data from/to Kafka. When upgrading to 1.5.2 everything works fine, except we get the following exception: java.lang.NoSuchMethodError:

Re: Fwd: how to submit multiple jar files when using spark-submit script in shell?

2016-01-12 Thread Jim Lohse
Thanks for your answer, you are correct, it's just a different approach than the one I am asking for :) Building an uber- or assembly- jar goes against the idea of placing the jars on all workers. Uber-jars increase network traffic, using local:/ in the classpath reduces network traffic.

Re: [KafkaRDD]: rdd.cache() does not seem to work

2016-01-12 Thread Понькин Алексей
Hi Charles, I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to illustrate the problem. https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala In a short - may be persist method is working but not like I expected. I thought

Re: How to find cause(waiting threads etc) of hanging job for 7 hours?

2016-01-12 Thread Umesh Kacha
Hi Prabhu thanks for the response. I did the same the problem is when I get process id using jps or ps - ef I don't get user in the very first column I see number in place of user name so can't run jstack on it because of permission issue it gives something like following 728852 3553 9833

PCA OutOfMemoryError

2016-01-12 Thread Bharath Ravi Kumar
We're running PCA (selecting 100 principal components) on a dataset that has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The matrix in question is mostly sparse with tens of columns populate in most rows, but a few rows with thousands of columns populated. We're running spark on

Does spark restart the executors if its nodemanager crashes?

2016-01-12 Thread Bing Jiang
hi, guys. We have set up the dynamic allocation resource on spark-yarn. Now we use spark 1.5. One executor tries to fetch data from another nodemanager's shuffle service, and the nodemanager crashes, which makes the executor stop on the states util the crashed nodemanager has been launched again.

RE: sparkR ORC support.

2016-01-12 Thread Felix Cheung
It looks like you have overwritten sc. Could you try this: Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))library(SparkR) sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext,

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana
Running this gave 16/01/12 04:06:54 INFO BlockManagerMaster: Registered BlockManagerError in writeJobj(con, object) : invalid jobj 3 How does it know which hive schema to connect to? On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung wrote: > It looks like you have

Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Barak Yaish
Hello, I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers. Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is as follow: export SPARK_MASTER_IP=10.52.39.92 export SPARK_WORKER_INSTANCES=4 export SPARK_WORKER_CORES=8 export SPARK_WORKER_MEMORY=4g

Re: Job History Logs for spark jobs submitted on YARN

2016-01-12 Thread Arkadiusz Bicz
Hi, You can checkout http://spark.apache.org/docs/latest/monitoring.html, you can monitor hdfs, memory usage per job and executor and driver. I have connected it to Graphite for storage and Grafana for visualization. I have also connected to collectd which provides me all server nodes metrics

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Jörn Franke
Ignite can also cache rdd > On 12 Jan 2016, at 13:06, Dmitry Goldenberg wrote: > > Jorn, you said Ignite or ... ? What was the second choice you were thinking > of? It seems that got omitted. > >> On Jan 12, 2016, at 2:44 AM, Jörn Franke

Use TCP client for id lookup

2016-01-12 Thread Kristoffer Sjögren
Hi I'm trying to understand how to lookup certain id fields of RDDs to an external mapping table. The table is accessed through a two-way binary tcp client where an id is provided and entry returned. Entries cannot be listed/scanned. What's the simplest way of managing the tcp client and its

Re: Mllib Word2Vec vector representations are very high in value

2016-01-12 Thread Nick Pentreath
The similarities returned are not in fact true cosine similarities as they are not properly normalized - this will be fixed in this PR: https://github.com/apache/spark/pull/10152 On Tue, Dec 15, 2015 at 2:54 AM, jxieeducation wrote: > Hi, > > For Word2Vec in Mllib, when

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana
It worked for sometime. Then I did sparkR.stop() an re-ran again to get the same error. Any idea why it ran fine before ( while running fine it kept giving warning reusing existing spark-context and that I should restart) ? There is one more R code which instantiated spark , I ran that too again.

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
Thanks Micheal. Let me test it with a recent master code branch. Also for every mapping step should I have to create a new case class? I cannot use Tuple as I have ~130 columns to process. Earlier I had used a Seq[Any] (actually Array[Any] to optimize on serialization) but processed it using RDD

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
try mapPartitionsWithIndex .. below is an example I used earlier. myfunc logic can be further modified as per your need. val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3) def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.toList.map(x => index + "," + x).iterator }

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"? Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote: > Hi, > > I am wondering if anyone has successfully enabled >

How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
Hello All - I'm just trying to understand aggregate() and in the meantime got an question. *Is there any way to view the RDD databased on the partition ?.* For the instance, the following RDD has 2 partitions val multi2s = List(2,4,6,8,10,12,14,16,18,20) val multi2s_RDD =

Big data job only finishes with Legacy memory management

2016-01-12 Thread Saif.A.Ellafi
Hello, I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I apply several window functions such as lag, first, etc. The job is quite expensive, I am running a small cluster with executors running with 70GB of ram. Using new memory management system, the job fails around

Eigenvalue solver

2016-01-12 Thread Lydia Ickler
Hi, I wanted to know if there are any implementations yet within the Machine Learning Library or generally that can efficiently solve eigenvalue problems? Or if not do you have suggestions on how to approach a parallel execution maybe with BLAS or Breeze? Thanks in advance! Lydia Von meinem

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust
> > df1.as[TestCaseClass].map(_.toMyMap).show() //fails > > This looks like a bug. What is the error? It might be fixed in branch-1.6/master if you can test there. > Please advice on what I may be missing here? > > > Also for join, may I suggest to have a custom encoder / transformation to >

Re: Lost tasks due to OutOfMemoryError (GC overhead limit exceeded)

2016-01-12 Thread Muthu Jayakumar
>export SPARK_WORKER_MEMORY=4g May be you could increase the max heapsize on the worker? In case if the OutOfMemory is for the driver, then you may want to set it up explicitly for the driver. Thanks, On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote: > Hello, > >

Re: Regarding sliding window example from Databricks for DStream

2016-01-12 Thread Cassa L
Any thoughts over this? I want to know when window duration is complete and not the sliding window. Is there a way I can catch end of Window Duration or do I need to keep track of it and how? LCassa On Mon, Jan 11, 2016 at 3:09 PM, Cassa L wrote: > Hi, > I'm trying to

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Vijay Kiran
I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor > On 12 Jan 2016, at 18:32, Corey Nolet wrote: > > David, > > Thank you very much for announcing this! It looks like it could be very > useful. Would you mind providing a link to the github? >

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Vijay Kiran
I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor > On 12 Jan 2016, at 18:32, Corey Nolet wrote: > > David, > > Thank you very much for announcing this! It looks like it could be very > useful. Would you mind providing a link to the github? >

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Muthu Jayakumar
I tried to rerun the same code with current snapshot version of 1.6 and 2.0 from https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/ But I still see an exception around the same line. Here is the exception below. Filed an issue against the same

rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it wouldn't even finish overnight anymore. the change was that the rdd was now derived from a dataframe. so the new code that runs forever is something like this: dataframe.rdd.map(row => (Row(row(0)), row)).join(...) any idea

Re: Spark 1.6 udf/udaf alternatives in dataset?

2016-01-12 Thread Michael Armbrust
Awesome, thanks for opening the JIRA! We'll take a look. On Tue, Jan 12, 2016 at 1:53 PM, Muthu Jayakumar wrote: > I tried to rerun the same code with current snapshot version of 1.6 and > 2.0 from >

Maintain a state till the end of the application

2016-01-12 Thread turing.us
Hi! I have some application (skeleton): val sc = new SparkContext($SOME_CONF) val input = sc.textFile(inputFile) val result = input.map(record => { val myState = new MyState() // state }) .filter($SOME_FILTER) .sortBy($SOME_SORT) .partitionBy(new HashPartitioner(100))

Job History Logs for spark jobs submitted on YARN

2016-01-12 Thread laxmanvemula
I observe that YARN jobs history logs are created in /user/history/done (*.jhist files) for all the mapreduce jobs like hive, pig etc. But for spark jobs submitted in yarn-cluster mode, the logs are not being created. I would like to see resource utilization by spark jobs. Is there any other

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tachyon is already a part of

Re: Unshaded google guava classes in spark-network-common jar

2016-01-12 Thread Sean Owen
No, this is on purpose. Have a look at the build POM. A few Guava classes were used in the public API for Java and have had to stay unshaded. In 2.x / master this is already changed such that no unshaded Guava classes should be included. On Tue, Jan 12, 2016, 07:28 Jake Yoon

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at 21:14, Dmitry Goldenberg

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-12 Thread ayan guha
On EMR, you can add fs.* params in emrfs-site.xml. On Tue, Jan 12, 2016 at 7:27 AM, Jonathan Kelly wrote: > Yes, IAM roles are actually required now for EMR. If you use Spark on EMR > (vs. just EC2), you get S3 configuration for free (it goes by the name > EMRFS), and it

Re: pre-install 3-party Python package on spark cluster

2016-01-12 Thread ayan guha
2 cents: 1. You should use an environment management tool, such as ansible, puppet or chef to handle this kind of use cases (and lot more, Eg what if you want to add more nodes or to replace one bad node) 2. There are options such as -py-files to provide a zip file On Tue, Jan 12, 2016 at 6:11

Top K Parallel FPGrowth Implementation

2016-01-12 Thread jcbarton
Hi, Are there any plans to implement the "Top K Parallel FPGrowth" algorithm in Spark, that aggregates the frequent items together (similar to the mahout implementation). Thanks John -- View this message in context:

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Would it make sense to load them into Tachyon and read and broadcast them from there since Tachyon is already a part of the Spark stack? If so I wonder if I could do that Tachyon read/write via a Spark API? > On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan >

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Thanks, Gene. Does Spark use Tachyon under the covers anyway for implementing its "cluster memory" support? It seems that the practice I hear the most about is the idea of loading resources as RDD's and then doing join's against them to achieve the lookup effect. The other approach would be to

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Michael Armbrust
There can be dataloss when you are using the DirectOutputCommitter and speculation is turned on, so we disable it automatically. On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam wrote: > Hi spark users and developers, > > I wonder if the following observed behaviour is expected.

How to change the no of cores assigned for a Submitted Job

2016-01-12 Thread Ashish Soni
Hi , I have a strange behavior when i creating standalone spark container using docker Not sure why by default it is assigning 4 cores to the first Job it submit and then all the other jobs are in wait state , Please suggest if there is an setting to change this i tried --executor-cores 1 but

[Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
Hi spark users and developers, I wonder if the following observed behaviour is expected. I'm writing dataframe to parquet into s3. I'm using append mode when I'm writing to it. Since I'm using org.apache.spark.sql. parquet.DirectParquetOutputCommitter as the

Re: How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
Hello Prem - Thanks for sharing and I also found the similar example from the link http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate But trying the understand the actual functionality or behavior. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jan 12, 2016 at

Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
I had explored these examples couple of months back. very good link for RDD operations. see if below explanation helps, try to understand the difference between below 2 examples.. initial value in both is """ Example 1; val z = sc.parallelize(List("12","23","","345"),2) z.aggregate("")((x,y) =>

Re: adding jars - hive on spark cdh 5.4.3

2016-01-12 Thread Ophir Etzion
btw, this issue happens only with classes needed for the inputFormat. if the input format is org.apache.hadoop.mapred.TextInputFormat and the serde is from an additional jar it works just fine. I don't want to upgrade cdh for this. also, if it should work on cdh5.5 why is that. what patch fixes

RE: Spark on Apache Ingnite?

2016-01-12 Thread Boavida, Rodrigo
I also had a quick look and agree it’s not very clear. I believe if one reads through the clustering logic and the replication settings would get a good idea of how it works. https://apacheignite.readme.io/docs/cluster I believe it integrates with Hadoop and other file based systems for

ROSE: Spark + R on the JVM.

2016-01-12 Thread David
Hi all, I'd like to share news of the recent release of a new Spark package, [ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor). ROSE is a Scala library offering access to the full scientific computing power of the R programming language to Apache Spark batch and

Re: sparkR ORC support.

2016-01-12 Thread Sandeep Khurana
I call stop from console as R studio warns and advises it. And yes. after stop was called the whole script was run again together. It means init "hivecontext <- sparkRHive.init(sc)" is called after stop always. On Tue, Jan 12, 2016 at 8:31 PM, Felix Cheung wrote: >

Re: sparkR ORC support.

2016-01-12 Thread Felix Cheung
As you can see from my reply below from Jan 6, calling sparkR.stop() invalidates both sc and hivecontext you have and results in this invalid jobj error. If you start R and run this, it should work: Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")

Re: ibsnappyjava.so: failed to map segment from shared object

2016-01-12 Thread Mikey T.
Thanks! Setting java.io.tmpdir did the trick. Sadly, I still ran into an issue with the amount of RAM pyspark was grabbing. In fact I got a message from my web provider warning that I was exceeding the memory limit for my (entry level) account. So I won't be pursuing it farther. Oh well, it

Re: Too many tasks killed the scheduler

2016-01-12 Thread Daniel Siegmann
As I understand it, your initial number of partitions will always depend on the initial data. I'm not aware of any way to change this, other than changing the configuration of the underlying data store. Have you tried reading the data in several data frames (e.g. one data frame per day),

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
it spark 1.5.1 the dataframe has simply 2 columns, both string a sql query would be more efficient probably, but doesnt fit out purpose (we are doing a lot more stuff where we need rdds). also i am just trying to understand in general what in that rdd coming from a dataframe could slow things

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Kevin Mellott
Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query. Also, which version of

Re: parquet repartitions and parquet.enable.summary-metadata does not work

2016-01-12 Thread Cheng Lian
I see. So there are actually 3000 tasks instead of 3000 jobs right? Would you mind to provide the full stack trace of the GC issue? At first I thought it's identical to the _metadata one in the mail thread you mentioned. Cheng On 1/11/16 5:30 PM, Gavin Yue wrote: Here is how I set the conf:

1.6.0: Standalone application: Getting ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2016-01-12 Thread Egor Pahomov
Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing serious issue. I successfully updated spark thrift server from 1.5.2 to 1.6.0. But I have standalone application, which worked fine with 1.5.2 but failing on 1.6.0 with: *NestedThrowables:* *java.lang.ClassNotFoundException:

Re: [Spark SQL]: Issues with writing dataframe with Append Mode to Parquet

2016-01-12 Thread Jerry Lam
Hi Michael, Thanks for the hint! So if I turn off speculation, consecutive appends like above will not produce temporary files right? Which class is responsible for disabling the use of DirectOutputCommitter? Thank you, Jerry On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Cheolsoo Park
Alex, see this jira- https://issues.apache.org/jira/browse/SPARK-9926 On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Ran into this need myself. Does Spark have an equivalent of "mapreduce. > input.fileinputformat.list-status.num-threads"? > > Thanks.

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Richard Siebeling
Hi, this looks great and seems to be very usable. Would it be possible to access the session API from within ROSE, to get for example the images that are generated by R / openCPU and the logging to stdout that is logged by R? thanks in advance, Richard On Tue, Jan 12, 2016 at 10:16 PM, Vijay

Read HDFS file from an executor(closure)

2016-01-12 Thread Udit Mehta
Hi, Is there a way to read a text file from inside a spark executor? I need to do this for an streaming application where we need to read a file(whose contents would change) from a closure. I cannot use the "sc.textFile" method since spark context is not serializable. I also cannot read a file

FPGrowth does not handle large result sets

2016-01-12 Thread Ritu Raj Tiwari
Folks:We are running into a problem where FPGrowth seems to choke on data sets that we think are not too large. We have about 200,000 transactions. Each transaction is composed of on an average 50 items. There are about 17,000 unique item (SKUs) that might show up in any transaction. When

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Thanks. I was actually able to get mapreduce.input. fileinputformat.list-status.num-threads working in Spark against a regular fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive. On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park wrote: > Alex, see this

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Gene Pang
Hi Dmitry, Yes, Tachyon can help with your use case. You can read and write to Tachyon via the filesystem api ( http://tachyon-project.org/documentation/File-System-API.html). There is a native Java API as well as a Hadoop-compatible API. Spark is also able to interact with Tachyon via the

ROSE: Spark + R on the JVM, now available.

2016-01-12 Thread David Russell
Hi all, I'd like to share news of the recent release of a new Spark package, [ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor). ROSE is a Scala library offering access to the full scientific computing power of the R programming language to Apache Spark batch and

How to optimiz and make this code faster using coalesce(1) and mapPartitionIndex

2016-01-12 Thread unk1102
Hi I have the following code which I run as part of thread which becomes child job of my main Spark job it takes hours to run for large data around 1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster with more data sets size sometimes it does not complete at all. Please

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David wrote: > Hi all, > > I'd like to share news of the recent release of a new Spark

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey, > Would you mind providing a link to the github? Sure, here is the github link you're looking for: https://github.com/onetapbeyond/opencpu-spark-executor David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re:

RE: ROSE: Spark + R on the JVM.

2016-01-12 Thread Chandan Verma
Definitely a great news for all the R and spark guys over here. From: Corey Nolet [mailto:cjno...@gmail.com] Sent: Tuesday, January 12, 2016 11:02 PM To: David Cc: user@spark.apache.org; d...@spark.apache.org Subject: Re: ROSE: Spark + R on the JVM. David, Thank you very much for announcing