Re: All executors run on just a few nodes

2014-10-20 Thread Tao Xiao
Raymond, Thank you. But I read from other thread http://apache-spark-user-list.1001560.n3.nabble.com/When-does-Spark-switch-from-PROCESS-LOCAL-to-NODE-LOCAL-or-RACK-LOCAL-td7091.html that PROCESS_LOCAL means the data is in the same JVM as the code that is running. When data is in the same JVM

Re: All executors run on just a few nodes

2014-10-20 Thread raymond
when the data’s source host is not one of the registered executors, it will also be marked as PROCESS_LOCAL too, though it should have a different NAME for this. I don’t know did someone change this name very recently. but for 0.9, it is the case . When I say satisfy, yes, if the executors

Re: checkpoint and not running out of disk space

2014-10-20 Thread sivarani
I am new to spark, i am using Spark streaming with Kafka.. My streaming duration is 1s.. Assume i get 100 records in 1s and 120 records in 2s and 80 records in 3s -- {sec 1 1,2,...100} -- {sec 2 1,2..120} -- {sec 3 1,2,..80} I apply my logic in sec 1 and have a result = result1 i want to use

Re: Upgrade to Spark 1.1.0?

2014-10-20 Thread Dmitriy Lyubimov
Mahout context does not include _all_ possible transitive dependencies. Would not be lighting fast to take all legacy etc. dependencies. There's an ignored unit test that asserts context path correctness. you can uningnore it and run to verify it still works as ex[ected.The reason it is set to

Re: MLlib linking error Mac OS X

2014-10-20 Thread poiuytrez
This is my error: 14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS However, it seems to work. What does it means? -- View this

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-20 Thread eric wong
Thanks for your reply! Sorry for i forgot referring the spark which i'm using is *Version 1.0.2* instead of 1.1.0. Also the document of 1.0.2 seems not same like 1.1.0: http://spark.apache.org/docs/1.0.2/running-on-yarn.html And i tried your suggestion(upload ) but did not work: *1. set my

New research using Spark: Unified Secure On-/Off-line Analytics

2014-10-20 Thread Peter Coetzee
New open-access research published in the journal of Parallel Computing demonstrates a novel approach to engineering analytics for deployment in streaming and batch contexts. Increasing numbers of users are extracting real value from their data using tools like IBM InfoSphere Streams for

Re: How to write a RDD into One Local Existing File?

2014-10-20 Thread Akhil Das
If you don't need part-xxx files in the output but 1 file, then you should repartition (or coalesce) the RDD into 1 (This will be bottleneck since you are disabling the parallelism - its like giving everything to 1 machine to process). You are better off merging those part-xxx files afterwards

What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Fengyun RAO
The exception drives me crazy, because it occurs randomly. I didn't know which line of my code causes this exception. I didn't even understand what KryoException: java.lang.NegativeArraySizeException means, or even implies? 14/10/20 15:59:01 WARN scheduler.TaskSetManager: Lost task 32.2 in stage

Re: Error while running Streaming examples - no snappyjava in java.library.path

2014-10-20 Thread Akhil Das
Its a known bug in JDK7 and OSX's naming convention, here's how to resolve it: 1. Get the Snappy jar file from http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ 2. Copy the appropriate one to your project's class path. Thanks Best Regards On Sun, Oct 19, 2014 at 10:18 PM, bdev

Re: why does driver connects to master fail ?

2014-10-20 Thread Akhil Das
What is the application that you are running? and what is the cluster setup that you are having? Given the logs, it looks like the master is dead for some reason. Thanks Best Regards On Sun, Oct 19, 2014 at 2:48 PM, randylu randyl...@gmail.com wrote: In additional, driver receives serveral

Re: Spark SQL on XML files

2014-10-20 Thread Akhil Das
One approach would be to convert those XML file into json file and use the jsonRDD, another approach would be to convert the XML file into parquet file and use parquetFile. Thanks Best Regards On Sun, Oct 19, 2014 at 9:38 AM, gtinside gtins...@gmail.com wrote: Hi , I have bunch of Xml files

Re: why fetch failed

2014-10-20 Thread Akhil Das
I used to hit this issue when my data size was too large and the number of partitions was too large ( 1200 ), I got ride of it by - Reducing the number of partitions - Setting the following while creating the sparkContext: .set(spark.rdd.compress,true)

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Gerard Maas
Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces.

Re: why does driver connects to master fail ?

2014-10-20 Thread randylu
Dear Akhil Das-2, My application runs in standalone mode, with 50 machines. It's okay if the input file is small, but if i increases the input to 8GB, the application just serveral iterations, and then print following error logs: 14/10/20 17:15:28 WARN AppClient$ClientActor: Connection to

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Gen
Hi, I will write the code in python {code:title=test.py} data = sc.textFile(...).map(...) ## Please make sure that the rdd is like[[id, c1, c2, c3], [id, c1, c2, c3],...] keypair = data.map(lambda l: ((l[0],l[1],l[2]), float(l[3]))) keypair = keypair.reduceByKey(add) out = keypair.map(lambda l:

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Fengyun RAO
Thank you, Guillaume, my dataset is not that large, it's totally ~2GB 2014-10-20 16:58 GMT+08:00 Guillaume Pitel guillaume.pi...@exensa.com: Hi, It happened to me with blocks which take more than 1 or 2 GB once serialized I think the problem was that during serialization, a Byte Array is

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
Thanks Akhil Das-2: actually I tried setting spark.default.parallelism but no effect :-/ I am running standalone and performing a mix of map/filter/foreachRDD. I had to force parallelism with repartition to get both workers to process tasks, but I do not think this should be required (and I am

Re: MLlib linking error Mac OS X

2014-10-20 Thread npomfret
I'm getting the same warning on my mac. Accompanied by what appears to be pretty low CPU usage (http://apache-spark-user-list.1001560.n3.nabble.com/mlib-model-build-and-low-CPU-usage-td16777.html), I wonder if they are connected? I've used jblas on a mac several times, it always just works

RDD to Multiple Tables SparkSQL

2014-10-20 Thread critikaled
Hi I have a rdd which I want to register as multiple tables based on key val context = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(context) import sqlContext.createSchemaRDD case class KV(key:String,id:String,value:String) val logsRDD =

Re: Spark Concepts

2014-10-20 Thread Kamal Banga
1) Yes, a single node can have multiple workers. SPARK_WORKER_INSTANCES (in conf/spark-env.sh) is used to set number of worker instances to run on each machine (default is 1). If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi, I’m using Spark 1.1.0 and I’m having some issues to setup memory options. I get “Requested array size exceeds VM limit” and I’m probably missing something regarding memory configuration (https://spark.apache.org/docs/1.1.0/configuration.html). My server has 30G of memory and this are my

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Akhil Das
Hi Arian, You will get this exception because you are trying to create an array that is larger than the maximum contiguous block of memory in your Java VMs heap. Here since you are setting Worker memory as *5Gb* and you are exporting the *_OPTS as *8Gb*, your application actually thinks it has

Re: What executes on worker and what executes on driver side

2014-10-20 Thread Kamal Banga
1. All RDD operations are executed in workers. So reading a text file or executing val x = 1 will happen on worker. (link http://stackoverflow.com/questions/24637312/spark-driver-in-apache-spark) 2. a. Without braodcast: Let's say you have 'n' nodes. You can set hadoop's replication factor to n

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Arian Pasquali
Hi Akhil, thanks for your help but I was originally running without xmx option. With that I was just trying to push the limit of my heap size, but obviously doing it wrong. Arian Pasquali http://about.me/arianpasquali 2014-10-20 12:24 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com: Hi

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Akhil Das
Try setting SPARK_EXECUTOR_MEMORY=5g (not sure how many workers you are having), You can also set the executor memory while creating the sparkContext (like *sparkContext.set(spark.executor.memory,5g)* ) Thanks Best Regards On Mon, Oct 20, 2014 at 5:01 PM, Arian Pasquali ar...@arianpasquali.com

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-10-20 Thread Guillaume Pitel
Hi, The array size you (or the serializer) tries to allocate is just too big for the JVM. No configuration can help : https://plumbr.eu/outofmemoryerror/requested-array-size-exceeds-vm-limit The only option is to split you problem further by increasing parallelism. Guillaume Hi, I’m using

Re: What does KryoException: java.lang.NegativeArraySizeException mean?

2014-10-20 Thread Guillaume Pitel
Well, reading your logs, here is what happens : You do a combineByKey (so you have a join probably somewhere), which spills on disk because it's too big. To spill on disk it serializes, and the blocks are 2GB. From a 2GB dataset, it's easy to exand to several TB Increase parallelism, make

How to show RDD size

2014-10-20 Thread marylucy
in spark-shell,I do in follows val input = sc.textfile(hdfs://192.168.1.10/people/testinput/) input.cache() In webui,I cannot see any rdd in storage tab.can anyone tell me how to show rdd size?thank you - To unsubscribe,

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas
I believe it won't show up there until you trigger an action that causes the RDD to actually be cached. Remember that certain operations in Spark are *lazy*, and caching is one of them. Nick On Mon, Oct 20, 2014 at 9:19 AM, marylucy qaz163wsx_...@hotmail.com wrote: in spark-shell,I do in

Re: What executes on worker and what executes on driver side

2014-10-20 Thread Saurabh Wadhawan
What about: http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com%3Ehttp://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/CAF_KkPwk7iiQVD2JzOwVVhQ_U2p3bPVM=-bka18v4s-5-lp...@mail.gmail.com Regards -

Re: Designed behavior when master is unreachable.

2014-10-20 Thread preeze
Hi Andrew, The behavior that I see now is that under the hood it tries to reconnect endlessly. While this lasts, the thread that tries to fire a new task is blocked at JobWaiter.awaitResult() and never gets released. The full stacktrace for spark-1.0.2 is: jmsContainer-7 prio=10

Re: How to show RDD size

2014-10-20 Thread marylucy
thank you for your reply! is unpersist operation lazy?if yes,how to decrease memory size as quickly as possible 在 Oct 20, 2014,21:26,Nicholas Chammas nicholas.cham...@gmail.com 写道: I believe it won't show up there until you trigger an action that causes the RDD to actually be cached.

Re: How to show RDD size

2014-10-20 Thread Nicholas Chammas
No, I believe unpersist acts immediately. On Mon, Oct 20, 2014 at 10:13 AM, marylucy qaz163wsx_...@hotmail.com wrote: thank you for your reply! is unpersist operation lazy?if yes,how to decrease memory size as quickly as possible 在 Oct 20, 2014,21:26,Nicholas Chammas

Re: MLlib linking error Mac OS X

2014-10-20 Thread Evan Sparks
MLlib relies on breeze for much of its linear algebra, which in turn relies on netlib-java. netlib-java will attempt to load a native BLAS at runtime and then attempt to load it's own precompiled version. Failing that, it will default back to a Java version that it has built in. The Java

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn On Oct 20, 2014, at 3:07 AM, Gerard Maas gerard.m...@gmail.com wrote: Pinging TD -- I'm sure you know :-)

Re: Spark Streaming scheduling control

2014-10-20 Thread davidkl
One detail, even forcing partitions (/repartition/), spark is still holding some tasks; if I increase the load of the system (increasing /spark.streaming.receiver.maxRate/), even if all workers are used, the one with the receiver gets twice as many tasks compared with the other workers. Total

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin, Sorry for the delay, but I’ll try the code change when I get a chance, but Michael’s initial response did solve my problem. In the meantime, I’m hitting another issue with SparkSQL which I will probably post another message if I can’t figure a workaround. Thanks, -Terry From: Yin

Re: why fetch failed

2014-10-20 Thread DB Tsai
I ran into the same issue when the dataset is very big. Marcelo from Cloudera found that it may be caused by SPARK-2711, so their Spark 1.1 release reverted SPARK-2711, and the issue is gone. See https://issues.apache.org/jira/browse/SPARK-3633 for detail. You can checkout Cloudera's version

Spark-jobserver for java apps

2014-10-20 Thread Tomer Benyamini
Hi, I'm working on the problem of remotely submitting apps to the spark master. I'm trying to use the spark-jobserver project (https://github.com/ooyala/spark-jobserver) for that purpose. For scala apps looks like things are working smoothly, but for java apps, I have an issue with implementing

Spark Streaming occasionally hangs after processing first batch

2014-10-20 Thread t1ny
Hi all, Spark Streaming occasionally (not always) hangs indefinitely on my program right after the first batch has been processed. As you can see in the following screenshots of the Spark Streaming monitoring UI, it hangs on the map stages that correspond (I assume) to the second batch that is

How to emit multiple keys for the same value?

2014-10-20 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, I am facing a problem with implementing this - My mapper should emit multiple keys for the same value - for every input (k, v) it should emit (k, v), (k+1, v),(k+2,v) (k+n,v). In MapReduce, it was pretty straight forward - I used a for loop and performed Context write within that.

Re: How to emit multiple keys for the same value?

2014-10-20 Thread Boromir Widas
flatMap should help, it returns a Seq for every input. On Mon, Oct 20, 2014 at 12:31 PM, HARIPRIYA AYYALASOMAYAJULA aharipriy...@gmail.com wrote: Hello, I am facing a problem with implementing this - My mapper should emit multiple keys for the same value - for every input (k, v) it should

Re: How to emit multiple keys for the same value?

2014-10-20 Thread DB Tsai
You can do this using flatMap which return a Seq of (key, value) pairs. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Oct 20, 2014 at 9:31 AM, HARIPRIYA AYYALASOMAYAJULA

Re: How to aggregate data in Apach Spark

2014-10-20 Thread Davies Liu
You also could use Spark SQL: from pyspark.sql import Row, SQLContext row = Row('id', 'C1', 'C2', 'C3') # convert each data = sc.textFile(test.csv).map(lambda line: line.split(',')) sqlContext = SQLContext(sc) rows = data.map(lambda r: row(*r))

SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi all, I’m getting a TreeNodeException for unresolved attributes when I do a simple select from a schemaRDD generated by a join in Spark 1.1.0. A little background first. I am using a HiveContext (against Hive 0.12) to grab two tables, join them, and then perform multiple INSERT-SELECT with

java.lang.OutOfMemoryError: Java heap space during reduce operation

2014-10-20 Thread ayandas84
Hi, *In a reduce operation I am trying to accumulate a list of SparseVectors. The code is given below;* val WNode = trainingData.reduce{(node1:Node,node2:Node) = val wNode = new Node(num1,num2) wNode.WhatList ++= (node1.WList)

Re: Oryx + Spark mllib

2014-10-20 Thread Debasish Das
Thanks for the pointers I will look into oryx2 design and see whether we need a spary/akka http based backend...I feel we will specially when we have a model database for a number of scenarios (say 100 scenarios build a different ALS model) I am not sure if we really need a full blown

CustomReceiver : ActorOf vs ActorSelection

2014-10-20 Thread vvarma
I have been trying to implement a CustomReceiverStream using akka actors. I have a feeder actor which produces messages and a Receiver actor(i give this to the spark streaming context to get actor stream) which subscribes to the feeder. When i create the feeder actor within the scope of the

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Michael Armbrust
Have you tried this on master? There were several problems with resolution of complex queries that were registered as tables in the 1.1.0 release. On Mon, Oct 20, 2014 at 10:33 AM, Terry Siu terry@smartfocus.com wrote: Hi all, I’m getting a TreeNodeException for unresolved attributes

Re: SparkSQL - TreeNodeException for unresolved attributes

2014-10-20 Thread Terry Siu
Hi Michael, Thanks again for the reply. Was hoping it was something I was doing wrong in 1.1.0, but I’ll try master. Thanks, -Terry From: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Date: Monday, October 20, 2014 at 12:11 PM To: Terry Siu

Re: RDD Cleanup

2014-10-20 Thread maihung
Hi Prem, I am experiencing the same problem on Spark 1.0.2 and Job Server 0.4.0 Did you find a solution for this problem? Thank you, Hung -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Cleanup-tp9182p16843.html Sent from the Apache Spark User List

Saving very large data sets as Parquet on S3

2014-10-20 Thread Daniel Mahler
I am trying to convert some json logs to Parquet and save them on S3. In principle this is just import org.apache.spark._ val sqlContext = new sql.SQLContext(sc) val data = sqlContext.jsonFile(s3n://source/path/*/*,10e-8) data.registerAsTable(data) data.saveAsParquetFile(s3n://target/path) This

Re: ALS implicit error pyspark

2014-10-20 Thread Gen
Hi, everyone, According to Xiangrui Meng(I think that he is the author of ALS), this problem is caused by Kryo serialization: /In PySpark 1.1, we switched to Kryo serialization by default. However, ALS code requires special registration with Kryo in order to work. The error happens when there

Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that this configures spark to use the available resources. I can see that spark will use the available memory on larger istance types. However I have never seen spark running at more than 400% (using 100% on 4 cores) on

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniil Osipov
How are you launching the cluster, and how are you submitting the job to it? Can you list any Spark configuration parameters you provide? On Mon, Oct 20, 2014 at 12:53 PM, Daniel Mahler dmah...@gmail.com wrote: I am launching EC2 clusters using the spark-ec2 scripts. My understanding is that

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Perhaps your RDD is not partitioned enough to utilize all the cores in your system. Could you post a simple code snippet and explain what kind of parallelism you are seeing for it? And can you report on how many partitions your RDDs have? On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler

example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
created a EC2 cluster using spark-ec2 command. If I run the pi.py example in the cluster without using the example.jar, it works. But if I added the example.jar as the driver class (sth like follows), it will fail with an exception. Could anyone help with this? -- what is the cause of the problem?

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all the workers busy. When I first launch the cluster I first do: + cat EOF ~/spark/conf/spark-defaults.conf spark.serializer

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I launch the cluster using vanilla spark-ec2 scripts. I just specify the number of slaves and instance type On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler dmah...@gmail.com wrote: I usually run interactively from the spark-shell. My data definitely has more than enough partitions to keep all

saveasSequenceFile with codec and compression type

2014-10-20 Thread gpatcham
Hi All, I'm trying to save RDD as sequencefile and not able to use compresiontype (BLOCK or RECORD) Can any one let me know how we can use compressiontype here is the code I'm using RDD.saveAsSequenceFile(target,Some(classOf[org.apache.hadoop.io.compress.GzipCodec])) Thanks -- View this

How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
At the end of a set of computation I have a JavaRDDString . I want a single file where each string is printed in order. The data is small enough that it is acceptable to handle the printout on a single processor. It may be large enough that using collect to generate a list might be unacceptable.

worker_instances vs worker_cores

2014-10-20 Thread anny9699
Hi, I have a question about the worker_instances setting and worker_cores setting in aws ec2 cluster. I understand it is a cluster and the default setting in the cluster is *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 1* However after I changed it to *SPARK_WORKER_CORES = 8

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Sean Owen
This was covered a few days ago: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html The multiple output files is actually essential for parallelism, and certainly not a bad idea. You don't want 100 distributed workers writing to 1

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
Are you dealing with gzipped files by any chance? Does explicitly repartitioning your RDD to match the number of cores in your cluster help at all? How about if you don't specify the configs you listed and just go with defaults all around? On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler

Re: Error while running Streaming examples - no snappyjava in java.library.path

2014-10-20 Thread Buntu Dev
Thanks Akhil. On Mon, Oct 20, 2014 at 1:57 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Its a known bug in JDK7 and OSX's naming convention, here's how to resolve it: 1. Get the Snappy jar file from http://central.maven.org/maven2/org/xerial/snappy/snappy-java/ 2. Copy the appropriate

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Steve Lewis
Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves are doing a lot of useful work but the generated output is small enough to be handled by a single process. Many of the large data problems I have worked process a lot of

Re: example.jar caused exception when running pi.py, spark 1.1

2014-10-20 Thread freedafeng
Fixed by recompiling. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/example-jar-caused-exception-when-running-pi-py-spark-1-1-tp16849p16862.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Hello Experts, After repeated attempt I am unable to run query on map json date string. I tried two approaches: *** Approach 1 *** created a Bean class with timespamp field. When I try to run it I get scala.MatchError: class java.sql.Timestamp (of class java.lang.Class). Here is the code: import

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Nicholas Chammas
The biggest danger with gzipped files is this: raw = sc.textFile(/path/to/file.gz, 8) raw.getNumPartitions()1 You think you’re telling Spark to parallelize the reads on the input, but Spark cannot parallelize reads against gzipped files. So 1 gzipped file gets assigned to 1 partition. It might

Re: Getting spark to use more than 4 cores on Amazon EC2

2014-10-20 Thread Daniel Mahler
I am using globs though raw = sc.textFile(/path/to/dir/*/*) and I have tons of files so 1 file per partition should not be a problem. On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The biggest danger with gzipped files is this: raw =

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Michael Armbrust
I think you are running into a bug that will be fixed by this PR: https://github.com/apache/spark/pull/2850 On Mon, Oct 20, 2014 at 4:34 PM, tridib tridib.sama...@live.com wrote: Hello Experts, After repeated attempt I am unable to run query on map json date string. I tried two approaches:

add external jars to spark-shell

2014-10-20 Thread Chuang Liu
Hi: I am using Spark 1.1, and want to add an external jars to spark-shell. I dig around, and found others are doing it in two ways. *Method 1* bin/spark-shell --jars path-to-jars --master ... *Method 2* ADD_JARS=path-to-jars SPARK_CLASSPATH=path-to-jars bin/spark-shell --master ... What is

Re: worker_instances vs worker_cores

2014-10-20 Thread Andrew Ash
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers running on a single box. If you change the number you change how the hardware you have is split up (useful for breaking large servers into 32GB heaps each which perform better) but doesn't change the amount of hardware you

Re: worker_instances vs worker_cores

2014-10-20 Thread Anny Chen
Thanks a lot Andrew! Yeah I actually realized that later. I made a silly mistake here. On Mon, Oct 20, 2014 at 6:03 PM, Andrew Ash and...@andrewash.com wrote: Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers running on a single box. If you change the number you change

Re: Shuffle files

2014-10-20 Thread Chen Song
My observation is opposite. When my job runs under default spark.shuffle.manager, I don't see this exception. However, when it runs with SORT based, I start seeing this error? How would that be possible? I am running my job in YARN, and I noticed that the YARN process limits (cat

RE: Shuffle files

2014-10-20 Thread Shao, Saisai
Hi Song, For what I know in sort-based shuffle. Normally parallel opened file numbers for sort-based shuffle is much smaller than hash-based shuffle. In hash based shuffle, parallel opened file numbers is C * R (where C is core number used and R is the reducer number), as you can see the file

Re: add external jars to spark-shell

2014-10-20 Thread Denny Lee
–jar (ADD_JARS) is a special class loading for Spark while –driver-class-path (SPARK_CLASSPATH) is captured by the startup scripts and appended to classpath settings that is used to start the JVM running the driver You can reference https://www.concur.com/blog/en-us/connect-tableau-to-sparksql

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Hi Tridib, For the second approach, can you attach the complete stack trace? Thanks, Yin On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust mich...@databricks.com wrote: I think you are running into a bug that will be fixed by this PR: https://github.com/apache/spark/pull/2850 On Mon, Oct

Re: why does driver connects to master fail ?

2014-10-20 Thread randylu
The cluster also runs other applcations every hour as normal, so the master is always running. No matter what the cores i use or the quantity of input-data(but big enough), the application just fail at 1.1 hours later. -- View this message in context:

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I think this has something to do with my recent work at https://issues.apache.org/jira/browse/SPARK-4003 You can check PR https://github.com/apache/spark/pull/2850 . Thanks, Daoyuan From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Tuesday, October 21, 2014 10:00 AM To: Michael Armbrust Cc:

Help with an error

2014-10-20 Thread Sunandan Chakraborty
Hi, I am trying to use spark to perform some basic text processing on news articles Recently I am facing issues on codes which ran perfectly well on the same data before I am pasting the last few lines, including the exception message. I am using python. Can anybody suggest a remedy Thanks,

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Seems the second approach does not go through applySchema. So, I was wondering if there is an issue related to our JSON apis in Java. On Mon, Oct 20, 2014 at 10:04 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: I think this has something to do with my recent work at

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that together. Thanks, Daoyuan From: Yin Huai [mailto:huaiyin@gmail.com] Sent: Tuesday, October 21, 2014 10:13 AM To: Wang, Daoyuan Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org Subject: Re: spark sql:

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Seems I made a mistake… From: Wang, Daoyuan Sent: Tuesday, October 21, 2014 10:35 AM To: 'Yin Huai' Cc: Michael Armbrust; tridib; u...@spark.incubator.apache.org Subject: RE: spark sql: timestamp in json - fails I got that, it is in JsonRDD.java of `typeOfPrimitiveValues`. I’ll fix that

Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-20 Thread tridib
Hi Spark SQL team, I trying to explore automatic schema detection for json document. I have few questions: 1. What should be the date format to detect the fields as date type? 2. Is automatic schema infer slower than applying specific schema? 3. At this moment I am parsing json myself using map

Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
sounds more like a use case for using collect... and writing out the file in your program? On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis lordjoe2...@gmail.com wrote: Sorry I missed the discussion - although it did not answer the question - In my case (and I suspect the askers) the 100 slaves

Re: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Stack trace for my second case: 2014-10-20 23:00:36,903 ERROR [Executor task launch worker-0] executor.Executor (Logging.scala:logError(96)) - Exception in task 0.0 in stage 0.0 (TID 0) scala.MatchError: TimestampType (of class org.apache.spark.sql.catalyst.types.TimestampType$) at

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Ilya Ganelin
Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-0 containing all the data. -Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
That's weird, I think we have that Pattern match in enforceCorrectType. What version of spark are you using? Thanks, Daoyuan -Original Message- From: tridib [mailto:tridib.sama...@live.com] Sent: Tuesday, October 21, 2014 11:03 AM To: u...@spark.incubator.apache.org Subject: Re: spark

RE: Convert Iterable to RDD

2014-10-20 Thread Dai, Kevin
In addition, how to convert Iterable[Iterable[T]] to RDD[T] Thanks, Kevin. From: Dai, Kevin [mailto:yun...@ebay.com] Sent: 2014年10月21日 10:58 To: user@spark.apache.org Subject: Convert Iterable to RDD Hi, All Is there any way to convert iterable to RDD? Thanks, Kevin.

RE: spark sql: timestamp in json - fails

2014-10-20 Thread tridib
Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-timestamp-in-json-fails-tp16864p16888.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
The exception of second approach, has been resolved by SPARK-3853. Thanks, Daoyuan -Original Message- From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com] Sent: Tuesday, October 21, 2014 11:06 AM To: tridib; u...@spark.incubator.apache.org Subject: RE: spark sql: timestamp in json -

RE: spark sql: timestamp in json - fails

2014-10-20 Thread Wang, Daoyuan
Yes, SPARK-3853 just got merged 11 days ago. It should be OK in 1.2.0. And for the first approach, It would be ok after SPARK-4003 is merged. -Original Message- From: tridib [mailto:tridib.sama...@live.com] Sent: Tuesday, October 21, 2014 11:09 AM To: u...@spark.incubator.apache.org

Re: default parallelism bug?

2014-10-20 Thread Yi Tian
Could you show your spark version ? And the value of `spark.default.parallelism` you are setting? Best Regards, Yi Tian tianyi.asiai...@gmail.com On Oct 20, 2014, at 12:38, Kevin Jung itsjb.j...@samsung.com wrote: Hi, I usually use file on hdfs to make PairRDD and analyze it by using

Re: How to not write empty RDD partitions in RDD.saveAsTextFile()

2014-10-20 Thread Yi Tian
I think you could use `repartition` to make sure there would be no empty partitions. You could also try `coalesce` to combine partitions , but it can't make sure there are no more empty partitions. Best Regards, Yi Tian tianyi.asiai...@gmail.com On Oct 18, 2014, at 20:30,

spark sql: join sql fails after sqlCtx.cacheTable()

2014-10-20 Thread tridib
Hello Experts, I have two tables build using jsonFile(). I can successfully run join query on these tables. But once I cacheTable(), all join query fails? Here is stackstrace: java.lang.NullPointerException at

Re: default parallelism bug?

2014-10-20 Thread Kevin Jung
I use Spark 1.1.0 and set these options to spark-defaults.conf spark.scheduler.mode FAIR spark.cores.max 48 spark.default.parallelism 72 Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/default-parallelism-bug-tp16787p16894.html Sent from the

Re: why does driver connects to master fail ?

2014-10-20 Thread Akhil Das
You could try setting the spark.akka.frameSize while creating the sparkContext, but its strange that the message it shows is saying your master is dead, usually its the other way, executor dies. Can you also explain the behavior of your application (what exactly you are doing over the 8Gb data)?

  1   2   >