Multiple Filter Effiency

2014-12-16 Thread zkidkid
Hi, Currently I am trying to count on a document with multiple filter. Let say, here is my document: //user field1 field2 field3 user1 0 0 1 user2 0 1 0 user3 0 0 0 I want to count on user.log for some filters like this: Filter1: field1 == 0 field 2 = 0 Filter2: field1 == 0 field 3 = 1

GC problem while filtering large data

2014-12-16 Thread Joe L
Hi I am trying to filter large table with 3 columns. Spark SQL might be a good choice but want to do it without SQL. The goal is to filter bigtable with multi clauses. I filtered bigtable 3times but the first filtering takes about 50seconds but the second and third filter transformation took about

Re: Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-16 Thread shenghua
A workaround trick is found and put in the ticket https://issues.apache.org/jira/browse/SPARK-4854. Hope this would be useful. -- View this message in context:

Re: Accessing Apache Spark from Java

2014-12-16 Thread Akhil Das
Hi Jai, Refer this doc and make sure your network is not blocking http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html Also make sure you are using the same version of spark in both places (the one on the cluster, and

Re: MLLib: Saving and loading a model

2014-12-16 Thread Jaonary Rabarisoa
Hi, There's is a ongoing work on model export https://www.github.com/apache/spark/pull/3062 For now, since LinearRegression is serializable you can save it as object file : sc.saveAsObjectFile(Seq(model)) then val model = sc.objectFile[LinearRegresionWithSGD](path).first model.predict(...)

GC problem while filtering

2014-12-16 Thread Batselem
Hi I am trying to filter a large table with 3 columns. My goal is to filter this bigtable using multi clauses. I filtered bigtable 3 times but the first filtering took about 50 seconds to complete whereas the second and third filter transformation took about 5 seconds. I wonder if it is because of

答复: Fetch Failed caused job failed.

2014-12-16 Thread Ma,Xi
Hi Das, Thanks for your advice. I'm not sure what's the usage of setting memoryFraction to 1. I've tried to rerun the test again with the following parameters in spark_default.conf, but failed again: spark.rdd.compress true spark.akka.frameSize 50 spark.storage.memoryFraction 0.8

Spark inserting into parquet files with different schema

2014-12-16 Thread AdamPD
Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second

Locality level and Kryo

2014-12-16 Thread aecc
Hi guys, It happens to me quite often that when the locality level of a task goes further than LOCAL (NODE, RACK, etc), I get some of the following exceptions: too many files open, encountered unregistered class id, cannot cast X to Y. I do not get any exceptions during shuffling (which means

Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit Caused by:

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das
print the CLASSPATH and make sure the spark assembly jar is there in the classpath Thanks Best Regards On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark

Re: Data Loss - Spark streaming

2014-12-16 Thread Gerard Maas
Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going work in Spark Streaming to fix HA. https://www.youtube.com/watch?v=jcJq3ZalXD8 -kr, Gerard. On Tue, Dec 16, 2014 at 11:32 AM, Jeniba Johnson

Re: KafkaUtils explicit acks

2014-12-16 Thread Mukesh Jha
I agree that this is not a trivial task as in this approach the kafka ack's will be done by the SparkTasks that means a plug-able mean to ack your input data source i.e. changes in core. From my limited experience with Kafka + Spark what I've seem is If spark tasks takes longer time than the

Why so many tasks?

2014-12-16 Thread bethesda
Our job is creating what appears to be an inordinate number of very small tasks, which blow out our os inode and file limits. Rather than continually upping those limits, we are seeking to understand whether our real problem is that too many tasks are running, perhaps because we are

Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread mj
I've got a simple pyspark program that generates two CSV files and then carries out a leftOuterJoin (a fact RDD joined to a dimension RDD). The program works fine for smaller volumes of records, but when it goes beyond 3 million records for the fact dataset, I get the error below. I'm running

Re: Why so many tasks?

2014-12-16 Thread Koert Kuipers
sc.textFile uses a hadoop input format. hadoop input formats by default create one task per file, and they are not very suitable for many very small files. can you turns your 1000 files into one larger text file? otherwise maybe try: val data = sc.textFile(/user/foo/myfiles/*).coalesce(100) On

Re: Why so many tasks?

2014-12-16 Thread Akhil Das
Try to repartition the data like: val data = sc.textFile(/user/foo/myfiles/*).repartition(100) ​Since the file size is small it shouldn't be a problem.​ Thanks Best Regards On Tue, Dec 16, 2014 at 6:21 PM, bethesda swearinge...@mac.com wrote: Our job is creating what appears to be an

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-16 Thread Gen
Hi,How many clients and how many products do you have?CheersGen jaykatukuri wrote Hi all,I am running into an out of memory error while running ALS using MLLIB on a reasonably small data set consisting of around 6 Million ratings.The stack trace is below:java.lang.OutOfMemoryError: Java heap

Re: Why so many tasks?

2014-12-16 Thread Gerard Maas
Creating an RDD from a wildcard like this: val data = sc.textFile(/user/foo/myfiles/*) Will create 1 partition for each file found. 1000 files = 1000 partitions. A task is a job stage (defined as a sequence of transformations) applied to a partition, so 1000 partitions = 1000 tasks per stage.

Re: Why so many tasks?

2014-12-16 Thread Gen
Hi, As you have 1,000 files, the RDD created by textFile will have 1,000 partitions. It is normal. In fact, as the same principal of HDFS, it is better to store data with smaller number of files but larger size file. You can use data.coalesce(10) to solve this problem(it reduce the number of

Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread sahanbull
Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int}) to a schema rdd using the schema: fields = [StructField('field1',StringType(),True),

Re: Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-16 Thread Gerard Maas
Hi Demi, Thanks for sharing. What we usually do is let the driver read the configuration for the job and pass the config object to the actual job as a serializable object. That way avoids the need of a centralized config sharing point that needs to be accessed from the workers. as you have

Re: KafkaUtils explicit acks

2014-12-16 Thread Cody Koeninger
Do you actually need spark streaming per se for your use case? If you're just trying to read data out of kafka into hbase, would something like this non-streaming rdd work for you: https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka Note

Re: Why so many tasks?

2014-12-16 Thread bethesda
Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-so-many-tasks-tp20712p20717.html Sent from the Apache Spark User List mailing list archive

Appending an incrental value to each RDD record

2014-12-16 Thread bethesda
I think this is sort of a newbie question, but I've checked the api closely and don't see an obvious answer: Given an RDD, how would I create a new RDD of Tuples where the first Tuple value is an incremented Int e.g. 1,2,3 ... and the second value of the Tuple is the original RDD record? I'm

NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread richiesgr
Hi This time I need expert. On 1.1.1 and only in cluster (standalone or EC2) when I use this code : countersPublishers.foreachRDD(rdd = { rdd.foreachPartition(partitionRecords = { partitionRecords.foreach(record = { //dbActorUpdater ! updateDBMessage(record)

Re: Appending an incrental value to each RDD record

2014-12-16 Thread Gerard Maas
You would do: rdd.zipWithIndexGives you an RDD[Original, Int] where the second element is the index. To have a (index,original) tuple, you will need to map that previous RDD to the desired shape: rdd.zipWithIndex.map(_.swap) -kr, Gerard. kr, Gerard. On Tue, Dec 16, 2014 at 4:12 PM,

Re: Appending an incrental value to each RDD record

2014-12-16 Thread mj
You could try using zipWIthIndex (links below to API docs). For example, in python: items =['a','b','c'] items2= sc.parallelize(items) print(items2.first()) items3=items2.map(lambda x: (x, x+!)) print(items3.first()) items4=items3.zipWithIndex() print(items4.first())

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
That's the first thing I tried... still the same error: hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1 hdfs@ams-rsrv01:/tmp/spark/spark-branch-1.1$ ./bin/spark-shell Spark assembly has been built with Hive, including

Re: Why so many tasks?

2014-12-16 Thread Ashish Rangole
Take a look at combine file input format. Repartition or coalesce could introduce shuffle I/O overhead. On Dec 16, 2014 7:09 AM, bethesda swearinge...@mac.com wrote: Thank you! I had known about the small-files problem in HDFS but didn't realize that it affected sc.textFile(). -- View

Re: Spark 1.2 + Avro does not work in HDP2.2

2014-12-16 Thread manasdebashiskar
Hi All, I saw some helps online about forcing avro-mapred to hadoop2 using classifiers. Now my configuration is thus val avro= org.apache.avro % avro-mapred % V.avro classifier hadoop2 How ever I still get java.lang.IncompatibleClassChangeError. I think I am not building spark

Re: Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar
It turns out that this happens when checkpoint is set to a local directory path. I have opened a JIRA SPARK-4862 for Spark streaming to output better error message. Thanks, Aniket On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am using spark 1.1.0 running a

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das
This is how it looks on my machine. [image: Inline image 1] Thanks Best Regards On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com wrote: That's the first thing I tried... still the same error: hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib

Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez
Dear spark community, We were testing a spark failure scenario where the executor that is running a Kafka Receiver dies. We are running our streaming jobs on top of mesos and we killed the mesos slave that was running the executor ; a new executor was created on another mesos-slave but

Re: Kafka Receiver not recreated after executor died

2014-12-16 Thread Luis Ángel Vicente Sánchez
It seems to be slightly related to this: https://issues.apache.org/jira/browse/SPARK-1340 But in this case, it's not the Task that is failing but the entire executor where the Kafka Receiver resides. 2014-12-16 16:53 GMT+00:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: Dear spark

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Same here... # jar tf lib/spark-assembly-1.1.2-SNAPSHOT-hadoop2.3.0.jar | grep SparkSubmit.class *org/apache/spark/deploy/SparkSubmit.class* On Tue, Dec 16, 2014 at 6:50 PM, Akhil Das ak...@sigmoidanalytics.com wrote: This is how it looks on my machine. [image: Inline image 1] Thanks

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das
And this is how my classpath looks like Spark assembly has been built with Hive, including Datanucleus jars on classpath

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
I've added every jar in the lib dir to my classpath and still no luck:

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Akhil Das
Can you open the file bin/spark-class and then put an echo $CLASSPATH below the place where they exports it and see what are the contents? On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote: I've added every jar in the lib dir to my classpath and still no luck:

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Completely diffrent than the one I set: Classpath is

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
I'm using CDH5 that was installed via Cloudera Manager. Does it matter? Thanks, Daniel On 16 בדצמ׳ 2014, at 19:18, Akhil Das ak...@sigmoidanalytics.com wrote: Can you open the file bin/spark-class and then put an echo $CLASSPATH below the place where they exports it and see what are the

Re: Appending an incrental value to each RDD record

2014-12-16 Thread bethesda
Thanks! zipWithIndex() works well. I had overlooked it because the name 'zip' is rather odd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Appending-an-incrental-value-to-each-RDD-record-tp20718p20722.html Sent from the Apache Spark User List mailing

Re: integrating long-running Spark jobs with Thriftserver

2014-12-16 Thread Tim Schweichler
To ask a related question, if I use Zookeeper for table locking, will this affect all attempts to access the Hive tables (including those from my Spark applications) or only those made through the Thriftserver? In other words, does Zookeeper provide concurrency for the Hive metastore in general

No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Okay, I have an rdd that I want to run an aggregate over but it insists on spilling to disk even though I structured the processing to only require a single pass. In other words, I can do all of my processing one entry in the rdd at a time without persisting anything. I set

Re: Spark 1.2 + Avro file does not work in HDP2.2

2014-12-16 Thread Zhan Zhang
Hi Manas, There is a small patch needed for HDP2.2. You can refer to this PR https://github.com/apache/spark/pull/3409 There are some other issues compiling against hadoop2.6. But we will fully support it very soon. You can ping me, if you want. Thanks. Zhan Zhang On Dec 12, 2014, at 11:38

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
In case a little more information is helpful: the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a .gz file (that's too big to fit on my disk). I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect. This rdd is what I'm calling aggregate on and I expect to

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User

Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya
Hi all – I’m running a long running batch-processing job with Spark through Yarn. I am doing the following Batch Process val resultsArr = sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]()) InMemoryArray.forEach{ 1) Using a thread pool, generate callable jobs that

Re: Understanding disk usage with Accumulators

2014-12-16 Thread Ganelin, Ilya
Also, this may be related to this issue https://issues.apache.org/jira/browse/SPARK-3885. Further, to clarify, data is being written to Hadoop on the data nodes. Would really appreciate any help. Thanks! From: Ganelin, Ganelin, Ilya

Re: Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-16 Thread Aniket Bhatnagar
Hi guys I am hoping someone might have a clue on why this is happening. Otherwise I will have to dwell into YARN module's source code to better understand the issue. On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running spark 1.1.0 on AWS EMR and I am

Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim Carroll
Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a transformation and not an action (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD with

Running Spark Job on Yarn from Java Code

2014-12-16 Thread Rahul Swaminathan
Hi all, I am trying to run a simple Spark_Pi application through Yarn from Java code. I have the Spark_Pi class and everything works fine if I run on Spark. However, when I set master to yarn-client and set yarn mode to true, I keep getting exceptions. I suspect this has something to do with

Re: Multiple Filter Effiency

2014-12-16 Thread Imran Rashid
I think accumulators do exactly what you want. (Scala syntax below, I'm just not familiar with the Java equivalent ...) val f1counts = sc.accumulator (0) val f2counts = sc.accumulator (0) val f3counts = sc.accumulator (0) textfile.foreach { s = if(f1matches) f1counts += 1 ... } Note that

RE: Spark SQL API Doc IsCached as SQL command

2014-12-16 Thread Judy Nash
Thanks Cheng. Tried it out and saw the InMemoryColumnarTableScan word in the physical plan. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, December 12, 2014 11:37 PM To: Judy Nash; user@spark.apache.org Subject: Re: Spark SQL API Doc IsCached as SQL command There isn’t a SQL

Re: streaming linear regression is not building the model

2014-12-16 Thread tsu-wt
I am having the same issue, and it still does not update for me. I am trying to execute the example by using bin/run-example -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-linear-regression-is-not-building-the-model-tp18522p20727.html Sent from

Cannot parse ListBuffer - StreamingLinearRegression

2014-12-16 Thread tsu-wt
Hi, I am trying to run StreamingLinearRegression example on single node w/ out hadoop installed. I keep getting the following error and cannot find any documentation about the issue: /14/12/16 13:26:50 ERROR JobScheduler: Error running job streaming job 141875801 ms.1

Re: NumberFormatException

2014-12-16 Thread Imran Rashid
wow, really weird. My intuition is the same as everyone else's, some unprintable character. Here's a couple more debugging tricks I've used in the past: //set up an accumulator to catch the bad rows as a side-effect val nBadRows = sc.accumulator(0) val nGoodRows = sc.accumulator(0) val badRows

Control default partition when load a RDD from HDFS

2014-12-16 Thread Shuai Zheng
Hi All, My application load 1000 files, each file from 200M - a few GB, and combine with other data to do calculation. Some pre-calculation must be done on each file level, then after that, the result need to combine to do further calculation. In Hadoop, it is simple because I can

Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote: Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
It's a bug, could you file a JIRA for this? thanks! On Tue, Dec 16, 2014 at 5:49 AM, sahanbull sa...@skimlinks.com wrote: Hi Guys, Im running a spark cluster in AWS with Spark 1.1.0 in EC2 I am trying to convert a an RDD with tuple (u'string', int , {(int, int): int, (int, int): int})

Spark eating exceptions in multi-threaded local mode

2014-12-16 Thread Corey Nolet
I've been running a job in local mode using --master local[*] and I've noticed that, for some reason, exceptions appear to get eaten- as in, I don't see them. If i debug in my IDE, I'll see that an exception was thrown if I step through the code but if I just run the application, it appears

Re: Error when Applying schema to a dictionary with a Tuple as key

2014-12-16 Thread Davies Liu
I had created https://issues.apache.org/jira/browse/SPARK-4866, it will be fixed by https://github.com/apache/spark/pull/3714. Thank you for reporting this. Davies On Tue, Dec 16, 2014 at 12:44 PM, Davies Liu dav...@databricks.com wrote: It's a bug, could you file a JIRA for this? thanks! On

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Harry Brundage
Are you certain that's happening Jim? Why? What happens if you just do sc.textFile(fileUri).count() ? If I'm not mistaken the Hadoop InputFormat for gzip and the RDD wrapper around it already has the streaming behaviour you wish for. but I could be wrong. Also, are you in pyspark or scala Spark?

Re: pyspark exception catch

2014-12-16 Thread cfregly
hey igor! a few ways to work around this depending on the level of exception-handling granularity you're willing to accept: 1) use mapPartitions() to wrap the entire partition handling code in a try/catch -- this is fairly coarse-grained, however, and will fail the entire partition. 2) modify

Re: Pyspark 1.1.1 error with large number of records - serializer.dump_stream(func(split_index, iterator), outfile)

2014-12-16 Thread Sebastián Ramírez
Your Spark is trying to load a hadoop library winutils.exe, which you don't have in your Windows: 14/12/16 12:48:28 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim
Hi Harry, Thanks for your response. I'm working in scala. When I do a count call it expands the RDD in the count (since it's an action). You can see the call stack that results in the failure of the job here: ERROR DiskBlockObjectWriter - Uncaught exception while reverting partial writes

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sebastián Ramírez
Are you reading the file from your driver (main / master) program? Is your file in a distributed system like HDFS? available to all your nodes? It might be due to the laziness of transformations: http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations Transformations are lazy,

S3 globbing

2014-12-16 Thread durga
Hi All, I need help with regex in my sc.textFile() I have lots of files with with epoch millisecond timestamp. ex:abc_1418759383723.json Now I need to consume last one hour files using the epoch time stamp as mentioned above. I tried couple of options , nothing seems working for me. If any

How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
I've been trying to figure out how to use Spark to do a simple aggregation without reparitioning and essentially creating fully instantiated intermediate RDDs and it seem virtually impossible. I've now gone as far as writing my own single parition RDD that wraps an Iterator[String] and calling

Re: How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
Wow. i just realized what was happening and it's all my fault. I have a library method that I wrote that presents the RDD and I was actually repartitioning it myself. I feel pretty dumb. Sorry about that. -- View this message in context:

toArray,first get the different result from one element RDD

2014-12-16 Thread buring
Hi Recently I have some problems about rdd behaviors.It's about RDD.first,RDD.toArray method when RDD only has one element. I get the different result in different method from one element RDD where i should have the same result. I will give more detail after the code. My

Re: Executor memory

2014-12-16 Thread Pala M Muthaia
Thanks for the clarifications. I misunderstood what the number on UI meant. On Mon, Dec 15, 2014 at 7:00 PM, Sean Owen so...@cloudera.com wrote: I believe this corresponds to the 0.6 of the whole heap that is allocated for caching partitions. See spark.storage.memoryFraction on

when will the spark 1.3.0 be released?

2014-12-16 Thread 张建轶
Hi ! when will the spark 1.3.0 be released? I want to use new LDA feature. Thank you!

Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Marco Shaw
When it is ready. On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote: Hi £¡ when will the spark 1.3.0 be released£¿ I want to use new LDA feature. Thank you!B‹CB•È[œÝXœØÜšX™KK[XZ[ˆ\Ù\‹][œÝXœØÜšX™P

Re: RDDs being cleaned too fast

2014-12-16 Thread Harihar Nahak
RDD.persist() can be useful here. On 11 December 2014 at 14:34, ankits [via Apache Spark User List] ml-node+s1001560n20613...@n3.nabble.com wrote: I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast. How can i inspect the size of RDD in memory and get more information

答复: 答复: Fetch Failed caused job failed.

2014-12-16 Thread Ma,Xi
Actually there was still Fetch failure. However, after I upgrade the spark to 1.1.1, this error was not met again. Thanks, Mars 发件人: Akhil Das [mailto:ak...@sigmoidanalytics.com] 发送时间: 2014年12月16日 17:52 收件人: Ma,Xi 抄送: u...@spark.incubator.apache.org 主题: Re: 答复: Fetch Failed caused job failed.

Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
Hi, I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I have two tables (backed by Parquet files) and I need to do a join across them using a common field (user_id). This works fine using standard SQL but not using the language-integrated DSL neither t1.join(t2, on =

Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Andrew Ash
Releases are roughly every 3mo so you should expect around March if the pace stays steady. 2014-12-16 22:56 GMT-05:00 Marco Shaw marco.s...@gmail.com: When it is ready. On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote: Hi £¡ when will the spark 1.3.0 be released£¿ I

Re: Running Spark Job on Yarn from Java Code

2014-12-16 Thread Kyle Lin
Hi there I also got exception when running PI example on YARN Spark version: spark-1.1.1-bin-hadoop2.4 My environment: Hortonworks HDP 2.2 My command: ./bin/spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi lib/spark-examples*.jar 10 Output logs: 14/12/17 14:06:32

RE: Control default partition when load a RDD from HDFS

2014-12-16 Thread Sun, Rui
Hi, Shuai, How did you turn off the file split in Hadoop? I guess you might have implemented a customized FileInputFormat which overrides isSplitable() to return FALSE. If you do have such FileInputFormat, you can simply pass it as a constructor parameter to HadoopRDD or NewHadoopRDD in Spark.

Re: Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
Another problem with the DSL: t1.where('term == dmin).count() returns zero. But sqlCtx.sql(select * from t1 where term = 'dmin').count() returns 700, which I know is correct from the data. Is there something wrong with how I'm using the DSL? Thanks On 17/12/14 11:13 am, Jerry Raj wrote:

RE: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-16 Thread Sun, Rui
Gautham, How many number of gz files do you have? Maybe the reason is that gz file is compressed that can't be splitted for processing by Mapreduce. A single gz file can only be processed by a single Mapper so that the CPU treads can't be fully utilized. -Original Message- From:

Rolling upgrade Spark cluster

2014-12-16 Thread Kenichi Maehashi
Hi, I have a Spark cluster using standalone mode. Spark Master is configured as High Availablity mode. Now I am going to upgrade Spark from 1.0 to 1.1, but don't want to interrupt the currently running jobs. (1) Are there any way to perform a rolling upgrade (while running a job)? (2) If not,

Re: Spark SQL DSL for joins?

2014-12-16 Thread Tobias Pfeiffer
Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj jerry@gmail.com wrote: Another problem with the DSL: t1.where('term == dmin).count() returns zero. Looks like you need ===: https://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD Tobias

Re: toArray,first get the different result from one element RDD

2014-12-16 Thread buring
I get the key point . The problem is in sc.sequenceFile,From API description RDD will create many references to the same objecty ,So I revise the code sessions.getBytes to sessions.getBytes.clone, It seems to work. Thanks. -- View this message in context:

Re: S3 globbing

2014-12-16 Thread Akhil Das
Did you try something like: //Get the last hour val d = (System.currentTimeMillis() - 3600 * 1000) val ex = abc_ + d.toString().substring(0,7) + *.json [image: Inline image 1] Thanks Best Regards On Wed, Dec 17, 2014 at 5:05 AM, durga durgak...@gmail.com wrote: Hi All, I need help with

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-16 Thread Kyle Lin
I also got the same problem.. 2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com: Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started

Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu
Could you post the stack trace? Best Regards, Shixiong Zhu 2014-12-16 23:21 GMT+08:00 richiesgr richie...@gmail.com: Hi This time I need expert. On 1.1.1 and only in cluster (standalone or EC2) when I use this code : countersPublishers.foreachRDD(rdd = {

Locality Level Kryo

2014-12-16 Thread aecc
Hi guys, I get Kryo exceptions of the type unregistered class id and cannot cast to class when the locality level of the tasks go beyond LOCAL. However I get no Kryo exceptions during shuffling operations. If the locality level never goes beyond LOCAL everything works fine. Is there a special