Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata file together with _common_metadata. I was using Hadoop 2.4.1 btw. I'm not sure whether Hadoop

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-27 Thread ๏̯͡๏
Ok. I modified as per your suggestions export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.3.0-hadoop2.4.0.jar export HADOOP_CONF_DIR=/apache/hadoop/conf cd $SPARK_HOME ./bin/spark-sql -v --driver-class-path

Re: Spark SQL configurations

2015-03-27 Thread Akhil Das
If you can share the stacktrace, then we can give your proper guidelines. For running on YARN, everything is described here: https://spark.apache.org/docs/latest/running-on-yarn.html Thanks Best Regards On Fri, Mar 27, 2015 at 8:21 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Hello, Can

Add partition support in saveAsParquet

2015-03-27 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn:

Re: FetchFailedException during shuffle

2015-03-27 Thread Akhil Das
What operation are you doing? I'm assuming you have enabled rdd compression and you are having an empty stream which it tries to uncompress (as seen from the Exceptions) Thanks Best Regards On Fri, Mar 27, 2015 at 7:15 AM, Chen Song chen.song...@gmail.com wrote: Using spark 1.3.0 on cdh5.1.0,

Re: Serialization Problem in Spark Program

2015-03-27 Thread Akhil Das
Awesome. Thanks Best Regards On Fri, Mar 27, 2015 at 7:26 AM, donhoff_h 165612...@qq.com wrote: Hi, Akhil Yes, it's the problem lies in. Thanks very much for point out my mistake. -- Original -- *From: * Akhil Das;ak...@sigmoidanalytics.com; *Send time:*

Re: RDD Exception Handling

2015-03-27 Thread Akhil Das
Like this? val krdd = testrdd.map(x = { try{var key = val tmp_tocks = x.split(sep1)(0)(key, x) }catch{ case e: Exception = println(Exception!! = + e + |||KS1 + x)(null, x) }}) Thanks Best Regards On Thu,

Can spark sql read existing tables created in hive

2015-03-27 Thread ๏̯͡๏
I have few tables that are created in Hive. I wan to transform data stored in these Hive tables using Spark SQL. Is this even possible ? So far i have seen that i can create new tables using Spark SQL dialect. However when i run show tables or do desc hive_table it says table not found. I am now

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
We can set a path, refer to the unit tests. For example: df.saveAsTable(savedJsonTable, org.apache.spark.sql.json, append, path =tmpPath) https://github.com/apache/spark/blob/master/python/pyspark/sql/tests.py Investigating some more, I found that the table is being created at the specified

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it produced a similar exception (though there was no use of explode there). On Fri, Mar 27, 2015 at 7:20 AM, Cheng Lian lian.cs@gmail.com wrote: This should be a bug in the Explode.eval(),

Spark streaming

2015-03-27 Thread jamborta
Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse the data as it comes in (turn into array), then store it. This resulted in out of memory errors with larger files (as a result of increased GC?). It turns out if the data gets

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
This should be a bug in the Explode.eval(), which always assumes the underlying SQL array is represented by a Scala Seq. Would you mind to open a JIRA ticket for this? Thanks! Cheng On 3/27/15 7:00 PM, Jon Chase wrote: Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Seems Spark SQL accesses some more columns apart from those created by hive. You can always recreate the tables, you would need to execute the table creation scripts but it would be good to avoid recreation. On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I did copy

Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode query to work on an array type b) I'm unable to save an array type to a Parquet file I keep running into this: java.lang.ClassCastException: [I cannot be cast to scala.collection.Seq Here's a stack trace from the

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop 1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng

Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread sayantini
In our application where we load our historical data in 40 partitioned RDDs (no. of available cores X 2) and we have not implemented any custom partitioner. After applying transformations on these RDDs intermediate RDDs are created which have partitions greater than 40 and sometimes partitions

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread ๏̯͡๏
I did copy hive-conf.xml form Hive installation into spark-home/conf. IT does have all the meta store connection details, host, username, passwd, driver and others. Snippet == configuration property namejavax.jdo.option.ConnectionURL/name

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
It happens only when StorageLevel is used with 1 replica ( StorageLevel. MEMORY_ONLY_2,StorageLevel.MEMORY_AND_DISK_2) , StorageLevel.MEMORY_ONLY , StorageLevel.MEMORY_AND_DISK works - the problems must be clearly somewhere between mesos-spark . From console I see that spark is trying to replicate

failed to launch workers on spark

2015-03-27 Thread mas
Hi all! I am trying to install spark on my standalone machine. I am able to run the master but when i try to run the slaves it gives me following error. Any help in this regard will highly be appreciated. _ localhost: failed to launch

saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
Hi, The behaviour is the same for me in Scala and Python, so posting here in Python. When I use DataFrame.saveAsTable with the path option, I expect an external Hive table to be created at the specified path. Specifically, when I call: df.saveAsTable(..., path=/tmp/test) I expect an external

Error in Delete Table

2015-03-27 Thread Masf
Hi. In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at

Re: saveAsTable with path not working as expected (pyspark + Scala)

2015-03-27 Thread Tom Walwyn
Another follow-up: saveAsTable works as expected when running on hadoop cluster with Hive installed. It's just locally that I'm getting this strange behaviour. Any ideas why this is happening? Kind Regards. Tom On 27 March 2015 at 11:29, Tom Walwyn twal...@gmail.com wrote: We can set a path,

Re: failed to launch workers on spark

2015-03-27 Thread Noorul Islam K M
mas mas.ha...@gmail.com writes: Hi all! I am trying to install spark on my standalone machine. I am able to run the master but when i try to run the slaves it gives me following error. Any help in this regard will highly be appreciated.

Re: Column not found in schema when querying partitioned table

2015-03-27 Thread ๏̯͡๏
Hello Jon, Are you able to connect to existing Hive and read tables created in hive ? Regards, deepak On Thu, Mar 26, 2015 at 4:16 PM, Jon Chase jon.ch...@gmail.com wrote: I've filed this as https://issues.apache.org/jira/browse/SPARK-6554 On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase

Re: Error while querying hive table from spark shell

2015-03-27 Thread ๏̯͡๏
Did you resolve this ? I am facing the same error On Wed, Feb 11, 2015 at 1:02 PM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Seems that the HDFS path for the table dosnt contains any file/data. Does the metastore contain the right path for HDFS data. You can find the HDFS path in

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Since hive and spark SQL internally use HDFS and Hive metastore. The only thing you want to change is the processing engine. You can try to bring your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive site xml captures the metastore connection details). Its a hack, i havnt

Re: Parallel actions from driver

2015-03-27 Thread Harut Martirosyan
This is exactly my case also, it worked, thanks Sean. On 26 March 2015 at 23:35, Sean Owen so...@cloudera.com wrote: You can do this much more simply, I think, with Scala's parallel collections (try .par). There's nothing wrong with doing this, no. Here, something is getting caught in your

Spark SQL and DataSources API roadmap

2015-03-27 Thread Ashish Mukherjee
Hello, Is there any published community roadmap for SparkSQL and the DataSources API? Regards, Ashish

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
Forgot to mention that, would you mind to also provide the full stack trace of the exception thrown in the saveAsParquetFile call? Thanks! Cheng On 3/27/15 7:35 PM, Jon Chase wrote: https://issues.apache.org/jira/browse/SPARK-6570 I also left in the call to saveAsParquetFile(), as it

saving schemaRDD to cassandra

2015-03-27 Thread Hafiz Mujadid
Hi experts! I would like to know is there anyway to store schemaRDD to cassandra? if yes then how to store in existing cassandra column family and new column family? Thanks -- View this message in context:

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread ๏̯͡๏
I can recreate tables but what about data. It looks like this is a obvious feature that Spark SQL must be having. People will want to transform tons of data stored in HDFS through Hive from Spark SQL. Spark programming guide suggests its possible. Spark SQL also supports reading and writing

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
More info when using *spark.mesos.coarse* everything works as expected. I think this must be a bug in spark-mesos integration. 2015-03-27 9:23 GMT+01:00 Ondrej Smola ondrej.sm...@gmail.com: It happens only when StorageLevel is used with 1 replica ( StorageLevel.

Checking Data Integrity in Spark

2015-03-27 Thread Sathish Kumaran Vairavelu
Hello, I want to check if there is any way to check the data integrity of the data files. The use case is perform data integrity check on large files 100+ columns and reject records (write it another file) that does not meet criteria's (such as NOT NULL, date format, etc). Since there are lot of

Re: Decrease In Performance due to Auto Increase of Partitions in Spark

2015-03-27 Thread Akhil Das
Each RDD is composed of multiple blocks known as partitions, when you apply transformation over it, then it can grow in size depending on the operation (as the # objects/references increase) and that is probably the reason why you are seeing increased number of partitions. I don't think increased

Re: Hive Table not from from Spark SQL

2015-03-27 Thread ๏̯͡๏
I tried the following 1) ./bin/spark-submit -v --master yarn-cluster --driver-class-path

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Jon Chase
Done. I also updated the name on the ticket to include both issues. Spark SQL arrays: explode() fails and cannot save array type to Parquet https://issues.apache.org/jira/browse/SPARK-6570 On Fri, Mar 27, 2015 at 8:14 AM, Cheng Lian lian.cs@gmail.com wrote: Forgot to mention that, would

Re: Checking Data Integrity in Spark

2015-03-27 Thread Arush Kharbanda
Its not possible to configure Spark to do checks based on xmls. You would need to write jobs to do the validations you need. On Fri, Mar 27, 2015 at 5:13 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello, I want to check if there is any way to check the data integrity of

Re: Spark streaming

2015-03-27 Thread DW @ Gmail
Show us the code. This shouldn't happen for the simple process you described Sent from my rotary phone. On Mar 27, 2015, at 5:47 AM, jamborta jambo...@gmail.com wrote: Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
Thanks for the detailed information! On 3/27/15 9:16 PM, Jon Chase wrote: Done. I also updated the name on the ticket to include both issues. Spark SQL arrays: explode() fails and cannot save array type to Parquet https://issues.apache.org/jira/browse/SPARK-6570 On Fri, Mar 27, 2015 at

Re: Spark streaming

2015-03-27 Thread Ted Yu
jamborta : Please also describe the format of your csv files. Cheers On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail deanwamp...@gmail.com wrote: Show us the code. This shouldn't happen for the simple process you described Sent from my rotary phone. On Mar 27, 2015, at 5:47 AM, jamborta

RE: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-27 Thread Bozeman, Christopher
Ankur, The JavaKinesisWordCountASLYARN is no longer valid and was added just to the EMR build back in 1.1.0 to demonstrate Spark Streaming with Kinesis in YARN, just follow the stock example as seen in JavaKinesisWordCountASL as it is better form anyway given it is best not to hard code the

Re: Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Davies Liu
This will be fixed in https://github.com/apache/spark/pull/5230/files On Fri, Mar 27, 2015 at 9:13 AM, Peter Mac peter.machar...@noaa.gov wrote: I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template

JettyUtils.createServletHandler Method not Found?

2015-03-27 Thread kmader
I have a very strange error in Spark 1.3 where at runtime in the org.apache.spark.ui.JettyUtils object the method createServletHandler is not found Exception in thread main java.lang.NoSuchMethodError:

How to avoid the repartitioning in graph construction

2015-03-27 Thread Yifan LI
Hi, Now I have 10 edge data files in my HDFS directory, e.g. edges_part00, edges_part01, …, edges_part09 format: srcId tarId (They make a good partitioning of that whole graph, so I never expect any change(re-partitoning operations) on them during graph building). I am thinking of how to

[Dataframe] Problem with insertIntoJDBC and existing database

2015-03-27 Thread Pierre Bailly-Ferry
Hello, I 'm trying to develop with the new Dataframe API, but I'm running into an error. I have an existing MySQL database and I want to insert rows. I create a Dataframe from an RDD, then use the insertIntoJDBC function. It appear that dataframes reorder the data inside them. As a result, I

Re: WordCount example

2015-03-27 Thread Mohit Anchlia
I checked the ports using netstat and don't see any connections established on that port. Logs show only this: 15/03/27 13:50:48 INFO Master: Registering app NetworkWordCount 15/03/27 13:50:48 INFO Master: Registered app NetworkWordCount with ID app-20150327135048-0002 Spark ui shows: Running

Re: JettyUtils.createServletHandler Method not Found?

2015-03-27 Thread Ted Yu
JettyUtils is marked with: private[spark] object JettyUtils extends Logging { FYI On Fri, Mar 27, 2015 at 9:50 AM, kmader kevin.ma...@gmail.com wrote: I have a very strange error in Spark 1.3 where at runtime in the org.apache.spark.ui.JettyUtils object the method createServletHandler is not

Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Saiph Kappa
Hi, I am just running this simple example with machineA: 1 master + 1 worker machineB: 1 worker « val ssc = new StreamingContext(sparkConf, Duration(1000)) val rawStreams = (1 to numStreams).map(_ =ssc.rawSocketStream[String](host, port, StorageLevel.MEMORY_ONLY_SER)).toArray val

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-03-27 Thread Tathagata Das
If it is deterministically reproducible, could you generate full DEBUG level logs, from the driver and the workers and give it to me? Basically I want to trace through what is happening to the block that is not being found. And can you tell what Cluster manager are you using? Spark Standalone,

RDD collect hangs on large input data

2015-03-27 Thread Zsolt Tóth
Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output. If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join. Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable

Re: Spark streaming

2015-03-27 Thread Tamas Jambor
It is just a comma separated file, about 10 columns wide which we append with a unique id and a few additional values. On Fri, Mar 27, 2015 at 2:43 PM, Ted Yu yuzhih...@gmail.com wrote: jamborta : Please also describe the format of your csv files. Cheers On Fri, Mar 27, 2015 at 6:42 AM, DW

Re: Hive Table not from from Spark SQL

2015-03-27 Thread Denny Lee
Upon reviewing your other thread, could you confirm that your Hive metastore that you can connect to via Hive is a MySQL database? And to also confirm, when you're running spark-shell and doing a show tables statement, you're getting the same error? On Fri, Mar 27, 2015 at 6:08 AM ÐΞ€ρ@Ҝ (๏̯͡๏)

Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Peter Mac
I downloaded spark version spark-1.3.0-bin-hadoop2.4. When the python version of sql.py is run the following error occurs: [root@nde-dev8-template python]# /root/spark-1.3.0-bin-hadoop2.4/bin/spark-submit sql.py Spark assembly has been built with Hive, including Datanucleus jars on classpath

spark streaming driver hang

2015-03-27 Thread Chen Song
I ran a spark streaming job. 100 executors 30G heap per executor 4 cores per executor The version I used is 1.3.0-cdh5.1.0. The job is reading from a directory on HDFS (with files incoming continuously) and does some join on the data. I set batch interval to be 15 minutes and the job worked

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Jörn Franke
Hallo, Well all problems you want to solve with technology need to have good justification for a certain technology. So the first thing is that you ask which technology fits to my current and future problems. This is also what the article says. Unfortunately, it does only provide a vague answer

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Ondrej Smola
Yes, only when using fine grained mode and replication (StorageLevel.MEMORY_ONLY_2 etc). 2015-03-27 19:06 GMT+01:00 Tathagata Das t...@databricks.com: Does it fail with just Spark jobs (using storage levels) on non-coarse mode? TD On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola

Re: spark streaming driver hang

2015-03-27 Thread Tathagata Das
Do you have the logs of the driver? Does that give any exceptions? TD On Fri, Mar 27, 2015 at 12:24 PM, Chen Song chen.song...@gmail.com wrote: I ran a spark streaming job. 100 executors 30G heap per executor 4 cores per executor The version I used is 1.3.0-cdh5.1.0. The job is reading

Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Michael Armbrust
Are you running on yarn? - If you are running in yarn-client mode, set HADOOP_CONF_DIR to /etc/hive/conf/ (or the directory where your hive-site.xml is located). - If you are running in yarn-cluster mode, the easiest thing to do is to add--files=/etc/hive/conf/hive-site.xml (or the path for

RDD resiliency -- does it keep state?

2015-03-27 Thread Michal Klos
Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down --

Re: RDD resiliency -- does it keep state?

2015-03-27 Thread Patrick Wendell
If you invoke this, you will get at-least-once semantics on failure. For instance, if a machine dies in the middle of executing the foreach for a single partition, that will be re-executed on another machine. It could even fully complete on one machine, but the machine dies immediately before

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to equal the number of executors? If your ETL changes the number of partitions, you can also repartition before calling KMeans. On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have a large

Re: Can't access file in spark, but can in hadoop

2015-03-27 Thread Johnson, Dale
Yes, I could recompile the hdfs client with more logging, but I don’t have the day or two to spare right this week. One more thing about this, the cluster is Horton Works 2.1.3 [.0] They seem to have a claim of supporting spark on Horton Works 2.2 Dale. From: Ted Yu

Re: Using ORC input for mllib algorithms

2015-03-27 Thread Xiangrui Meng
This is a PR in review to support ORC via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use sc.hadoopFile(directory,

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Xiangrui Meng
Hi Martin, Could you attach the code snippet and the stack trace? The default implementation of some methods uses reflection, which may be the cause. Best, Xiangrui On Wed, Mar 25, 2015 at 3:18 PM, zapletal-mar...@email.cz wrote: Thanks Peter, I ended up doing something similar. I however

Re: Implicit matrix factorization returning different results between spark 1.2.0 and 1.3.0

2015-03-27 Thread Xiangrui Meng
This sounds like a bug ... Did you try a different lambda? It would be great if you can share your dataset or re-produce this issue on the public dataset. Thanks! -Xiangrui On Thu, Mar 26, 2015 at 7:56 AM, Ravi Mody rmody...@gmail.com wrote: After upgrading to 1.3.0, ALS.trainImplicit() has been

Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Eran Medan
Remember that article that went viral on HN? (Where a guy showed how GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on a 1 thread machine? if not here is the article - http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) Well as you may

Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Manoj Samel
While looking into a issue, I noticed that the source displayed on Github site does not matches the downloaded tar for 1.3 Thoughts ?

Re: Spark 1.3 Source - Github and source tar does not seem to match

2015-03-27 Thread Patrick Wendell
The source code should match the Spark commit 4aaf48d46d13129f0f9bdafd771dd80fe568a7dc. Do you see any differences? On Fri, Mar 27, 2015 at 11:28 AM, Manoj Samel manojsamelt...@gmail.com wrote: While looking into a issue, I noticed that the source displayed on Github site does not matches the

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
Does it fail with just Spark jobs (using storage levels) on non-coarse mode? TD On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola ondrej.sm...@gmail.com wrote: More info when using *spark.mesos.coarse* everything works as expected. I think this must be a bug in spark-mesos integration.

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Joseph Bradley
Hi Martin, In the short term: Would you be able to work with a different type other than Vector? If so, then you can override the *Predictor* class's *protected def featuresDataType: DataType* with a DataFrame type which fits your purpose. If you need Vector, then you might have to do a hack

[spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-27 Thread Eran Medan
Hi everyone, I had a lot of questions today, sorry if I'm spamming the list, but I thought it's better than posting all questions in one thread. Let me know if I should throttle my posts ;) Here is my question: When I try to have a case class that has Any in it (e.g. I have a property map and

Understanding Spark Memory distribution

2015-03-27 Thread Ankur Srivastava
Hi All, I am running a spark cluster on EC2 instances of type: m3.2xlarge. I have given 26gb of memory with all 8 cores to my executors. I can see that in the logs too: *15/03/27 21:31:06 INFO AppClient$ClientActor: Executor added: app-20150327213106-/0 on

Re: Strange JavaDeserialization error - java.lang.ClassNotFoundException: org/apache/spark/storage/StorageLevel

2015-03-27 Thread Tathagata Das
Seems like a bug, could you file a JIRA? @Tim: Patrick said you take a look at Mesos related issues. Could you take a look at this. Thanks! TD On Fri, Mar 27, 2015 at 1:25 PM, Ondrej Smola ondrej.sm...@gmail.com wrote: Yes, only when using fine grained mode and replication

2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hello, I am using the Spark shell in Scala on the localhost. I am using sc.textFile to read a directory. The directory looks like this (generated by another Spark script): part-0 part-1 _SUCCESS The part-0 has four short lines of text while part-1 has two short lines of text.

Re: HQL function Rollup and Cube

2015-03-27 Thread Chang Lim
Yes, it works for me. Make sure the Spark machine can access the hive machine. On Thu, Mar 26, 2015 at 6:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Did you manage to connect to Hive metastore from Spark SQL. I copied hive conf file into Spark conf folder but when i run show tables, or do

Re: How to specify the port for AM Actor ...

2015-03-27 Thread Manoj Samel
I looked @ the 1.3.0 code and figured where this can be added In org.apache.spark.deploy.yarn ApplicationMaster.scala:282 is actorSystem = AkkaUtils.createActorSystem(sparkYarnAM, Utils.localHostName, 0, conf = sparkConf, securityManager = securityMgr)._1 If I change it to below,

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Zhan Zhang
Hi Rares, The number of partition is controlled by HDFS input format, and one file may have multiple partitions if it consists of multiple block. In you case, I think there is one file with 2 splits. Thanks. Zhan Zhang On Mar 27, 2015, at 3:12 PM, Rares Vernica

Re: Can't access file in spark, but can in hadoop

2015-03-27 Thread Zhan Zhang
Probably guava version conflicts issue. What spark version did you use, and which hadoop version it compile against? Thanks. Zhan Zhang On Mar 27, 2015, at 12:13 PM, Johnson, Dale daljohn...@ebay.commailto:daljohn...@ebay.com wrote: Yes, I could recompile the hdfs client with more logging,

RE: 2 input paths generate 3 partitions

2015-03-27 Thread java8964
The files sound too small to be 2 blocks in HDFS. Did you set the defaultParallelism to be 3 in your spark? Yong Subject: Re: 2 input paths generate 3 partitions From: zzh...@hortonworks.com To: rvern...@gmail.com CC: user@spark.apache.org Date: Fri, 27 Mar 2015 23:15:38 + Hi Rares,

Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several

Setting a custom loss function for GradientDescent

2015-03-27 Thread shmoanne
I am working with the mllib.optimization.GradientDescent class and I'm confused about how to set a custom loss function with setGradient? For instance, if I wanted my loss function to be x^2 how would I go about setting it using setGradient? -- View this message in context:

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Xi Shen
Yes, I have done repartition. I tried to repartition to the number of cores in my cluster. Not helping... I tried to repartition to the number of centroids (k value). Not helping... On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley jos...@databricks.com wrote: Can you try specifying the number

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
JIRA ticket created at: https://issues.apache.org/jira/browse/SPARK-6581 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian lian.cs@gmail.com wrote: Thanks for the information. Verified that the _common_metadata and _metadata file are missing in this case when using Hadoop

Re: 2 input paths generate 3 partitions

2015-03-27 Thread Rares Vernica
Hi, I am not using HDFS, I am using the local file system. Moreover, I did not modify the defaultParallelism. The Spark instance is the default one started by Spark Shell. Thanks! Rares On Fri, Mar 27, 2015 at 4:48 PM, java8964 java8...@hotmail.com wrote: The files sound too small to be 2

unable to read avro file

2015-03-27 Thread Joanne Contact
Hi I am following the instruction on this website. http://www.infoobjects.com/spark-with-avro/ I installed the sparkavro libary on https://github.com/databricks/spark-avro on a machine which only has hive gateway client role on a hadoop cluster. somehow I got error on reading the avro file.

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Sean Owen
(I bet the Spark implementation could be improved. I bet GraphX could be optimized.) Not sure about this one, but in core benchmarks often start by assuming that the data is local. In the real world, data is unlikely to be. The benchmark has to include the cost of bringing all the data to the

Re: Understanding Spark Memory distribution

2015-03-27 Thread Ankur Srivastava
I have increased the spark.storage.memoryFraction to 0.4 but I still get OOM errors on Spark Executor nodes 15/03/27 23:19:51 INFO BlockManagerMaster: Updated info of block broadcast_5_piece10 15/03/27 23:19:51 INFO TorrentBroadcast: Reading broadcast variable 5 took 2704 ms 15/03/27 23:19:52

Re: unable to read avro file

2015-03-27 Thread Joanne Contact
never mind. find my spark is still 1.2 but the avro library requires 1.3. will try again. On Fri, Mar 27, 2015 at 9:38 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi I am following the instruction on this website. http://www.infoobjects.com/spark-with-avro/ I installed the sparkavro

rdd.toDF().saveAsParquetFile(tachyon://host:19998/test)

2015-03-27 Thread sud_self
spark version is 1.3.0 with tanhyon-0.6.1 QUESTION DESCRIPTION: rdd.saveAsObjectFile(tachyon://host:19998/test) and rdd.saveAsTextFile(tachyon://host:19998/test) succeed, but rdd.toDF().saveAsParquetFile(tachyon://host:19998/test) failure. ERROR MESSAGE: