Re: Understanding Spark's caching

2015-04-28 Thread ayan guha
Hi I replied you in SO. If option A had a action call then it should suffice too. On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote: Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
ah, just noticed that you are using an external package, you can add that like this conf = (SparkConf().set(spark.jars, jar_path)) or if it is a python package: sc.addPyFile() -- View this message in context:

Re: Spark 1.3.1 Hadoop 2.4 Prebuilt package broken ?

2015-04-28 Thread ๏̯͡๏
Worked now. On Mon, Apr 27, 2015 at 10:20 PM, Sean Owen so...@cloudera.com wrote: Works fine for me. Make sure you're not downloading the HTML redirector page and thinking it's the archive. On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I downloaded 1.3.1

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
Hi Mark, That does not look like an python path issue, spark-assembly jar should have those packaged, and should make it available for the workers. Have you built the jar yourself? -- View this message in context:

Re: Serialization error

2015-04-28 Thread ๏̯͡๏
arguments are values of it. The name of the argument is important and all you need to do is specify those when your creating SparkConf object. Glad it worked. On Tue, Apr 28, 2015 at 5:20 PM, madhvi madhvi.gu...@orkash.com wrote: Thankyou Deepak.It worked. Madhvi On Tuesday 28 April 2015

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Todd Nist
Can you simply apply the https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter to this? You should be able to do something like this: val stats = RDD.map(x = x._2).stats() -Todd On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io

Single stream with series of transformations

2015-04-28 Thread jc.francisco
Hi, I'm following the pattern of filtering data by a certain criteria, and then saving the results to a different table. The code below illustrates the idea. The simple integration test I wrote suggests it works, simply asserting filtered data should be in their respective tables after being

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
Thank you Silvio, I am aware of groubBy limitations and this is subject for replacement. I did try repartitionAndSortWithinPartitions but then I end up with maybe too much shuffling one from groupByKey and the other from repartition. My expectation was that since N records are partitioned to

Re: Scalability of group by

2015-04-28 Thread Richard Marscher
Hi, I can offer a few ideas to investigate in regards to your issue here. I've run into resource issues doing shuffle operations with a much smaller dataset than 2B. The data is going to be saved to disk by the BlockManager as part of the shuffle and then redistributed across the cluster as

Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread subscripti...@prismalytics.io
Hello Friends: I generated a Pair RDD with K/V pairs, like so: rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463),

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
Are the tasks on the slaves also running as root? If not, that might explain the problem. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Question about Memory Used and VCores Used

2015-04-28 Thread bit1...@163.com
Hi,guys, I have the following computation with 3 workers: spark-sql --master yarn --executor-memory 3g --executor-cores 2 --driver-memory 1g -e 'select count(*) from table' The resources used are shown as below on the UI: I don't understand why the memory used is 15GB and vcores used is 5. I

Re: HBase HTable constructor hangs

2015-04-28 Thread tridib
I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread ai he
Hi Sarath, It might be questionable to set num-executors as 64 if you only has 8 nodes. Do you use any action like collect which will overwhelm the driver since you have a large dataset? Thanks On Tue, Apr 28, 2015 at 10:50 AM, sarath sarathkrishn...@gmail.com wrote: I am trying to train a

RE: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Mohammed Guller
One reason Spark on disk is faster than MapReduce is Spark’s advanced Directed Acyclic Graph (DAG) engine. MapReduce will require a complex job to be split into multiple Map-Reduce jobs, with disk I/O at the end of each job and beginning of a new job. With Spark, you may be able to express the

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-28 Thread Koert Kuipers
our experience is that unless you can benefit from spark features such as co-partitioning that allow for more efficient execution that spark is slightly slower for disk to disk. On Apr 27, 2015 10:34 PM, bit1...@163.com bit1...@163.com wrote: Hi, I am frequently asked why spark is also much

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
1. The full command line is written in a shell script: LIB=/home/spark/.m2/repository /opt/spark/bin/spark-submit \ --class spark.pcap.run.TestPcapSpark \ --jars

Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
I think currently there's no API in Spark Streaming you can use to get the file names for file input streams. Actually it is not trivial to support this, may be you could file a JIRA with wishes you want the community to support, so anyone who is interested can take a crack on this. Thanks Jerry

Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
Can you give us more information ? Such as hbase release, Spark release. If you can pastebin jstack of the hanging HTable process, that would help. BTW I used http://search-hadoop.com/?q=spark+HBase+HTable+constructor+hangs and saw a very old thread with this subject. Cheers On Tue, Apr 28,

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
btw, from spark web ui, the acl is marked with root Best regards, Lin Hao XU IBM Research China Email: xulin...@cn.ibm.com My Flickr: http://www.flickr.com/photos/xulinhao/sets From: Dean Wampler deanwamp...@gmail.com To: Lin Hao Xu/China/IBM@IBMCN Cc: Hai Shan Wu/China/IBM@IBMCN,

Re: Serialization error

2015-04-28 Thread madhvi
Thankyou Deepak.It worked. Madhvi On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: val conf = new SparkConf() .setAppName(detail) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize).get)

Re: 1.3.1: Persisting RDD in parquet - Conflicting partition column names

2015-04-28 Thread ayan guha
Can you show your code please? On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote: Hi I am getting the following error when persisting an RDD in parquet format to an S3 location. This is code that was working in the 1.2 version. The version that it is failing to work is 1.3.1. Any help is

Re: How to add jars to standalone pyspark program

2015-04-28 Thread Fabian Böhnlein
Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used --jars databricks-csv.jar. With spark-submit you might need the additional --driver-class-path databricks-csv.jar. Both parameters cannot be set via the SparkConf object.

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
It's probably not your code. What's the full command line you use to submit the job? Are you sure the job on the cluster has access to the network interface? Can you test the receiver by itself without Spark? For example, does this line work as expected: ListPcapNetworkInterface nifs =

Re: submitting to multiple masters

2015-04-28 Thread michal.klo...@gmail.com
According to the docs it should go like this: spark://host1:port1,host2:port2 https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper Thanks M On Apr 28, 2015, at 8:13 AM, James King jakwebin...@gmail.com wrote: I have multiple masters running and I'm

Spark partitioning question

2015-04-28 Thread Marius Danciu
Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input

Re: How to add jars to standalone pyspark program

2015-04-28 Thread ayan guha
Its a windows thing. Please escape front slash in string. Basically it is not able to find the file On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote: Can you specifiy 'running via PyCharm'. how are you executing the script, with spark-submit? In PySpark I guess you used

Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or a bug? rok On

Re: spark-defaults.conf

2015-04-28 Thread James King
So no takers regarding why spark-defaults.conf is not being picked up. Here is another one: If Zookeeper is configured in Spark why do we need to start a slave like this: spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh 1 spark://somemaster:7077 i.e. why do we need to specify the master url

submitting to multiple masters

2015-04-28 Thread James King
I have multiple masters running and I'm trying to submit an application using spark-1.3.0-bin-hadoop2.4/bin/spark-submit with this config (i.e. a comma separated list of master urls) --master spark://master01:7077,spark://master02:7077 But getting this exception

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread sarathkrishn...@gmail.com
Hi, I'm just calling the standard SVMWithSGD implementation of Spark's MLLib. I'm not using any method like collect. Thanks, Sarath On Tue, Apr 28, 2015 at 4:35 PM, ai he heai0...@gmail.com wrote: Hi Sarath, It might be questionable to set num-executors as 64 if you only has 8 nodes. Do

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Lin Hao Xu
Actually, to simplify this problem, we run our program on a single machine with 4 slave workers. Since on a single machine, I think all slave workers are ran with root privilege. BTW, if we have a cluster, how to make sure slaves on remote machines run program as root? Best regards, Lin Hao XU

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread bit1...@163.com
Looks to me that the same thing also applies to the SparkContext.textFile or SparkContext.wholeTextFile, there is no way in RDD to figure out the file information where the data in RDD is from bit1...@163.com From: Saisai Shao Date: 2015-04-29 10:10 To: lokeshkumar CC: spark users

Best practices on testing Spark jobs

2015-04-28 Thread Michal Michalski
Hi, I have two questions regarding testing Spark jobs: 1. Is it possible to use Mockito for that purpose? I tried to use it, but it looks like there are no interactions with mocks. I didn't dive into the details of how Mockito works, but I guess it might be because of the serialization and how

Re: hive-thriftserver maven artifact

2015-04-28 Thread Ted Yu
Credit goes to Misha Chernetsov (see SPARK-4925) FYI On Tue, Apr 28, 2015 at 8:25 AM, Marco marco@gmail.com wrote: Thx Ted for the info ! 2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com: This is available for 1.3.1:

Re: Best practices on testing Spark jobs

2015-04-28 Thread Sourav Chandra
Hi, Can you give some tutorials/examples how to write test case based on the mentioned framework? Thanks, Sourav On Tue, Apr 28, 2015 at 9:22 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Sorry that’s correct, I was thinking you were maybe trying to mock certain aspects of Spark

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Sorry that’s correct, I was thinking you were maybe trying to mock certain aspects of Spark core to write your tests. This is a library to help write unit tests by managing the SparkContext and StreamingContext. So you can test your transformations as necessary. More importantly on the

Re: Spark partitioning question

2015-04-28 Thread Silvio Fiorito
So the other issue could due to the fact that using mapPartitions after the partitionBy, you essentially lose the partitioning of the keys since Spark assumes the keys were altered in the map phase. So really the partitionBy gets lost after the mapPartitions, that’s why you need to do it again.

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1 = data.aggregateByKey((0.0, 0))((a, b) = (a._1 + b, a._2 + 1), (a, b) = (a._1 + b._1, a._2 + b._2)) val avgByKey = step1.mapValues(i = i._1/i._2) Essentially, what this is doing is passing an

Re: hive-thriftserver maven artifact

2015-04-28 Thread Marco
Thx Ted for the info ! 2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com: This is available for 1.3.1: http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 FYI On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote: Ok, so will it be only

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Hi Michal, Please try spark-testing-base by Holden. I’ve used it and it works well for unit testing batch and streaming jobs https://github.com/holdenk/spark-testing-base Thanks, Silvio From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:32 AM To: user Subject: Best practices on

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Vadim Bichutskiy
I was wondering about the same thing. Vadim ᐧ On Tue, Apr 28, 2015 at 10:19 PM, bit1...@163.com bit1...@163.com wrote: Looks to me that the same thing also applies to the SparkContext.textFile or SparkContext.wholeTextFile, there is no way in RDD to figure out the file information where the

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0. Here is the jstack trace. Complete stack trace attached. Executor task launch worker-1 #58 daemon prio=5 os_prio=0 tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000] java.lang.Thread.State: TIMED_WAITING (sleeping)

How to stream all data out of a Kafka topic once, then terminate job?

2015-04-28 Thread dgoldenberg
Hi, I'm wondering about the use-case where you're not doing continuous, incremental streaming of data out of Kafka but rather want to publish data once with your Producer(s) and consume it once, in your Consumer, then terminate the consumer Spark job. JavaStreamingContext jssc = new

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread Saisai Shao
I think it might be useful in Spark Streaming's file input stream, but not sure is it useful in SparkContext#textFile, since we specify the file by our own, so why we still need to know the file name. I will open up a JIRA to mention about this feature. Thanks Jerry 2015-04-29 10:49 GMT+08:00

Re: Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I was having this issue when my batch interval was very big -- like 5 minutes. When my batch interval is smaller, I don't get this exception. Can someone explain to me why this might be happening? Vadim ᐧ On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: I am

External Application Run Status

2015-04-28 Thread Nastooh Avessta (navesta)
Hi In a multi-node setup, I am invoking a number external apps, through Runtime.getRuntime.exec from rdd.map function, and would like to track their completion status. Evidently, such calls spawn a separate thread, which is not tracked by the standalone scheduler, i.e., reduce or collect are

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread bit1...@163.com
For the SparkContext#textFile, if a directory is given as the path parameter ,then it will pick up the files in the directory, so the same thing will occur. bit1...@163.com From: Saisai Shao Date: 2015-04-29 10:54 To: Vadim Bichutskiy CC: bit1...@163.com; lokeshkumar; user Subject: Re: Re:

Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
How did you distribute hbase-site.xml to the nodes ? Looks like HConnectionManager couldn't find the hbase:meta server. Cheers On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com wrote: I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0. Here is the jstack trace. Complete

RE: HBase HTable constructor hangs

2015-04-28 Thread Tridib Samanta
I am 100% sure how it's picking up the configuration. I copied the hbase-site.xml in hdfs/spark cluster (single machine). I also included hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site and mapred-site and core-site.xml in it. One interesting thing is, when I run

How to run self-build spark on EC2?

2015-04-28 Thread Bo Fu
Hi all, I have an issue. I added some timestamps in Spark source code and built it using: mvn package -DskipTests I checked the new version in my own computer and it works. However, when I ran spark on EC2, the spark code EC2 machines ran is the original version. Anyone knows how to deploy

solr in spark

2015-04-28 Thread Jeetendra Gangele
Does anyone tried using solr inside spark? below is the project describing it. https://github.com/LucidWorks/spark-solr. I have a requirement in which I want to index 20 millions companies name and then search as and when new data comes in. the output should be list of companies matching the

Re: solr in spark

2015-04-28 Thread Jeetendra Gangele
Thanks for reply. Elastic search index will be within my Cluster? or I need the separate host the elastic search? On 28 April 2015 at 22:03, Nick Pentreath nick.pentre...@gmail.com wrote: I haven't used Solr for a long time, and haven't used Solr in Spark. However, why do you say

Re: solr in spark

2015-04-28 Thread Nick Pentreath
I haven't used Solr for a long time, and haven't used Solr in Spark. However, why do you say Elasticsearch is not a good option ...? ES absolutely supports full-text search and not just filtering and grouping (in fact it's original purpose was and still is text search, though filtering, grouping

Re: solr in spark

2015-04-28 Thread andy petrella
AFAIK Datastax is heavily looking at it. they have a good integration of Cassandra with it. the next was clearly to have a strong combination of the three in one of the coming releases Le mar. 28 avr. 2015 18:28, Jeetendra Gangele gangele...@gmail.com a écrit : Does anyone tried using solr

PySpark: slicing issue with dataframes

2015-04-28 Thread Ali Bajwa
Hi experts, Trying to use the slicing functionality in strings as part of a Spark program (PySpark) I get this error: Code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname': ['Jones',

Re: Question regarding join with multiple columns with pyspark

2015-04-28 Thread Ali Bajwa
Thanks again Ayan! To close the loop on this issue, I have filed the below JIRA to track the issue: https://issues.apache.org/jira/browse/SPARK-7197 On Fri, Apr 24, 2015 at 8:21 PM, ayan guha guha.a...@gmail.com wrote: I just tested, your observation in DataFrame API is correct. It behaves

Re: How to deploy self-build spark source code on EC2

2015-04-28 Thread Nicholas Chammas
[-dev] [+user] This is a question for the user list, not the dev list. Use the --spark-version and --spark-git-repo options to specify your own repo and hash to deploy. Source code link. https://github.com/apache/spark/blob/268c419f1586110b90e68f98cd000a782d18828c/ec2/spark_ec2.py#L189-L195

Spark - Timeout Issues - OutOfMemoryError

2015-04-28 Thread ๏̯͡๏
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB size) and it takes 16 executors to do so. I wanted to run it against 10 files of each input type (10*3 files as there are three inputs that are transformed). [Input1 = 10*750 MB, Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used

Spark streaming - textFileStream/fileStream - Get file name

2015-04-28 Thread lokeshkumar
Hi Forum, Using spark streaming and listening to the files in HDFS using textFileStream/fileStream methods, how do we get the fileNames which are read by these methods? I used textFileStream which has file contents in JavaDStream and I got no success with fileStream as it is throwing me a

Weird error/exception

2015-04-28 Thread Vadim Bichutskiy
I am using Spark Streaming to monitor an S3 bucket. Everything appears to be fine. But every batch interval I get the following: *15/04/28 16:12:36 WARN HttpMethodReleaseInputStream: Attempting to release HttpMethod in finalize() as its response data stream has gone out of scope. This attempt

RE: Scalability of group by

2015-04-28 Thread Ulanov, Alexander
Richard, The same problem is with sort. I have enough disk space and tmp folder. The errors in logs tell out of memory. I wonder what does it hold in memory? Alexander From: Richard Marscher [mailto:rmarsc...@localytics.com] Sent: Tuesday, April 28, 2015 7:34 AM To: Ulanov, Alexander Cc:

Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi, You can apply this patch https://github.com/apache/spark/pull/5354 and recompile. Hope this helps, Calvin On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com wrote: Hi Zhang, How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon version to 0.6.3 in

Re: Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
yes On 29 Apr 2015 03:31, ayan guha guha.a...@gmail.com wrote: Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to

rdd.count with 100 elements taking 1 second to run

2015-04-28 Thread Anshul Singhle
Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to 100) rdd.count This takes around 1.2s to run. Is this expected or am I configuring something wrong? I'm using about 30 cores with 512MB executor memory As expected, GC time is

Re: Initial tasks in job take time

2015-04-28 Thread ayan guha
Are your driver running on the same m/c as master? On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec.

Metric collection

2015-04-28 Thread Giovanni Paolo Gibilisco
Hi, I would like to collect some metrics from spark and plot them with graphite. I managed to do that withe the metrics provided by the or.apache.park.metrics.source.JvmSource but I would like to know if there are other sources available beside this one. Best, Giovanni

Spark Sql: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-04-28 Thread LinQili
Hi all. I was launching a spark sql job on my own machine, not on the spark cluster machines, and failed. The excpetion info is: 15/04/28 16:28:04 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.RuntimeException: Unable to

[Spark SQL] Problems creating a table in specified schema/database

2015-04-28 Thread James Aley
Hey all, I'm trying to create tables from existing Parquet data in different schemata. The following isn't working for me: CREATE DATABASE foo; CREATE TABLE foo.bar USING com.databricks.spark.avro OPTIONS (path '...'); -- Error: org.apache.spark.sql.AnalysisException: cannot recognize input

Re: New JIRA - [SQL] Can't remove columns from DataFrame or save DataFrame from a join due to duplicate columns

2015-04-28 Thread ayan guha
Alias function not in python yet. I suggest to write SQL if your data suits it On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote: https://issues.apache.org/jira/browse/SPARK-7182 Can anyone suggest a workaround for the above issue? Thanks. -Don -- Donald Drake Drake Consulting

Re: Serialization error

2015-04-28 Thread ๏̯͡๏
val conf = new SparkConf() .setAppName(detail) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize ).get) .set(spark.kryoserializer.buffer.max.mb, arguments.get( maxbuffersize).get)

Re: solr in spark

2015-04-28 Thread Nick Pentreath
Depends on your use case and search volume. Typically you'd have a dedicated ES cluster if your app is doing a lot of real time indexing and search. If it's only for spark integration then you could colocate ES and spark — Sent from Mailbox On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra

Re: JAVA_HOME problem

2015-04-28 Thread Marcelo Vanzin
Are you using a Spark build that matches your YARN cluster version? That seems like it could happen if you're using a Spark built against a newer version of YARN than you're running. On Thu, Apr 2, 2015 at 12:53 AM, 董帅阳 917361...@qq.com wrote: spark 1.3.0 spark@pc-zjqdyyn1:~ tail

Initial tasks in job take time

2015-04-28 Thread Anshul Singhle
Hi, I'm running short spark jobs on rdds cached in memory. I'm also using a long running job context. I want to be able to complete my jobs (on the cached rdd) in under 1 sec. I'm getting the following job times with about 15 GB of data distributed across 6 nodes. Each executor has about 20GB of

default number of reducers

2015-04-28 Thread Shushant Arora
In Normal MR job can I configure ( cluster wide) default number of reducers - if I don't specify any reducers in my job

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
The initializer is a tuple (0, 0) it seems you just have 0 From: subscripti...@prismalytics.iomailto:subscripti...@prismalytics.io Organization: PRISMALYTICS, LLC. Reply-To: subscripti...@prismalytics.iomailto:subscripti...@prismalytics.io Date: Tuesday, April 28, 2015 at 1:28 PM To: Silvio

Code For Loading Graph from Edge Tuple File

2015-04-28 Thread geek2
Hi Everyone, Does anyone have example code for generating a graph from a file of edge name-edge name tuples? I've seen the example where a Graph is generated from an RDD of triplets composed of edge longs, but I'd like to see an example where a graph is built from a edge name-edge -name file such

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread subscripti...@prismalytics.io
Thank you Todd, Silvio... I had to stare at Silvio's answer for a while. _If I'm interpreting the aggregateByKey() statement__correctly ... _ (Within-Partition Reduction Step) a: is a TUPLE that holds: (runningSum, runningCount). b: is a SCALAR that holds the next Value

How to run customized Spark on EC2?

2015-04-28 Thread Bo Fu
Hi experts, I have an issue. I added some timestamps in Spark source code and built it using: mvn package -DskipTests I checked the new version in my own computer and it works. However, when I ran spark on EC2, the spark code EC2 machines ran is the original version. Anyone knows how to

How to setup this false streaming problem

2015-04-28 Thread Toni Cebrián
Hi, Just new to Spark and in need of some help for framing the problem I have. A problem well stated is half solved it's the saying :) Let's say that I have a DStream[String] basically containing Json of some measurements from IoT devices. In order to keep it simple say that after

MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread sarath
I am trying to train a large dataset consisting of 8 million data points and 20 million features using SVMWithSGD. But it is failing after running for some time. I tried increasing num-partitions, driver-memory, executor-memory, driver-max-resultSize. Also I tried by reducing the size of dataset

Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-28 Thread lokeshkumar
Hi Forum I am facing below compile error when using the fileStream method of the JavaStreamingContext class. I have copied the code from JavaAPISuite.java test class of spark test code. The error message is

Re: Spark Streaming: JavaDStream compute method NPE

2015-04-28 Thread Himanshu Mehra
Hi Puneith, Please provide the code if you may. It will be helpful. Thank you, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-JavaDStream-compute-method-NPE-tp22676p22684.html Sent from the Apache Spark User List mailing list archive at

Re: JAVA_HOME problem

2015-04-28 Thread sourabh chaki
I was able to solve this problem hard coding the JAVA_HOME inside org.apache.spark.deploy.yarn.Client.scala class. *val commands = prefixEnv ++ Seq(-- YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + /bin/java, -server++ /usr/java/jdk1.7.0_51/bin/java, -server)* Somehow

Re: Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-28 Thread Akhil Das
How about: JavaPairDStreamLongWritable, Text input = jssc.fileStream(inputDirectory, LongWritable.class, Text.class, TextInputFormat.class); See the complete example over here

Re: How to debug Spark on Yarn?

2015-04-28 Thread Steve Loughran
On 27 Apr 2015, at 07:51, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.commailto:deepuj...@gmail.com wrote: Spark 1.3 1. View stderr/stdout from executor from Web UI: when the job is running i figured out the executor that am suppose to see, and those two links show 4 special characters on browser. 2.

Re: java.lang.UnsupportedOperationException: empty collection

2015-04-28 Thread Robineast
I've tried running your code through spark-shell on both 1.3.0 (pre-built for Hadoop 2.4 and above) and a recently built snapshot of master. Both work fine. Running on OS X yosemite. What's your configuration? -- View this message in context:

Re: Serialization error

2015-04-28 Thread madhvi
On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: val conf = new SparkConf() .setAppName(detail) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) .set(spark.kryoserializer.buffer.mb, arguments.get(buffersize).get)

Re: Understanding Spark's caching

2015-04-28 Thread Akhil Das
Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. Also note, In Option A, you are not specifying any

How to add jars to standalone pyspark program

2015-04-28 Thread mj
Hi, I'm trying to figure out how to use a third party jar inside a python program which I'm running via PyCharm in order to debug it. I am normally able to run spark code in python such as this: spark_conf = SparkConf().setMaster('local').setAppName('test') sc =