Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Mich Talebzadeh
Hi These are my notes on this topic. - *YARN Cluster Mode,* the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is invoked with –master yarn and --deploy-mode

Re: performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Pedro Rodriguez
There are a few blog posts that detail one possible/likely issue for example: http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 TLDR: The hadoop libraries spark uses assumes that its input comes from a file system (works with HDFS) however S3 is a key value store, not a

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Yu Wei
If cluster runs out of memory, it seems that the executor will be restarted by cluster manager. Jared, (韦煜) Software developer Interested in open source software, big data, Linux From: Ascot Moss Sent: Thursday, July 28, 2016 9:48:13 AM

performance problem when reading lots of small files created by spark streaming.

2016-07-27 Thread Andy Davidson
I have a relatively small data set however it is split into many small JSON files. Each file is between maybe 4K and 400K This is probably a very common issue for anyone using spark streaming. My streaming app works fine, how ever my batch application takes several hours to run. All I am doing

Re: A question about Spark Cluster vs Local Mode

2016-07-27 Thread Andy Davidson
Hi Ascot When you run in cluster mode it means your cluster manager will cause your driver to execute on one of the works in your cluster. The advantage of this is you can log on to a machine in your cluster and submit your application and then log out. The application will continue to run.

Re: read only specific jsons

2016-07-27 Thread Cody Koeninger
No, I literally meant filter on _corrupt_record, which has a magic meaning in dataframe api to identify lines that didn't match the schema. On Wed, Jul 27, 2016 at 12:19 PM, vr spark wrote: > HI , > I tried and getting exception still..any other suggestion? > > clickDF =

A question about Spark Cluster vs Local Mode

2016-07-27 Thread Ascot Moss
Hi, If I submit the same job to spark in cluster mode, does it mean in cluster mode it will be run in cluster memory pool and it will fail if it runs out of cluster's memory? --driver-memory 64g \ --executor-memory 16g \ Regards

DecisionTree currently only supports maxDepth <= 30

2016-07-27 Thread Ascot Moss
Hi, Is there any reason behind to limit maxDepth <= 30? Can it be deeper? Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = 50. at

saveAsTextFile at treeEnsembleModels.scala:447, took 2.513396 s Killed

2016-07-27 Thread Ascot Moss
Hi, Please help! When saving the model, I got following error and cannot save the model to hdfs: (my source code, my spark is v1.6.2) my_model.save(sc, "/my_model") - 16/07/28 08:36:19 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose tasks have all

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Andrew Ash
You sometimes have to hard refresh to get the page to update. On Wed, Jul 27, 2016 at 5:12 PM, Jim O'Flaherty wrote: > Nevermind, it literally just appeared right after I posted this. > > > > -- > View this message in context: >

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty
Nevermind, it literally just appeared right after I posted this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-download-2-0-The-main-download-page-isn-t-showing-it-tp27420p27421.html Sent from the Apache Spark User List mailing list archive at

How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty
How do I download 2.0? The main download page isn't showing it? And all the other download links point to the same single download page. This is the one I end up at: http://spark.apache.org/downloads.html -- View this message in context:

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Mich Talebzadeh
And frankly this is becoming some sort of religious arguments now Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

how to copy local files to hdfs quickly?

2016-07-27 Thread Andy Davidson
I have a spark streaming app that saves JSON files to s3:// . It works fine Now I need to calculate some basic summary stats and am running into horrible performance problems. I want to run a test to see if reading from hdfs instead of s3 makes difference. I am able to quickly copy my the data

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Sudhir Babu Pothineni
It depends on what you are dong, here is the recent comparison of ORC, Parquet https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet Although from ORC authors, I thought fair comparison, We use ORC as System of Record on our Cloudera HDFS cluster, our experience

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread janardhan shetty
Seems like parquet format is better comparatively to orc when the dataset is log data without nested structures? Is this fair understanding ? On Jul 27, 2016 1:30 PM, "Jörn Franke" wrote: > Kudu has been from my impression be designed to offer somethings between > hbase and

Re: spark-2.x what is the default version of java ?

2016-07-27 Thread Jacek Laskowski
Hi, The default version of Java is 7. It's being discussed when to settle on 8 as the default version. Nobody knows when it happens. Jacek On 27 Jul 2016 11:00 p.m., "Andy Davidson" wrote: > I currently have to configure spark-1.x to use Java 8 and python 3.x. I

spark-2.x what is the default version of java ?

2016-07-27 Thread Andy Davidson
I currently have to configure spark-1.x to use Java 8 and python 3.x. I noticed that http://spark.apache.org/releases/spark-release-2-0-0.html#removals mentions java 7 is deprecated. Is the default now Java 8 ? Thanks Andy Deprecations The following features have been deprecated in Spark

Run times for Spark 1.6.2 compared to 2.1.0?

2016-07-27 Thread Colin Beckingham
I have a project which runs fine in both Spark 1.6.2 and 2.1.0. It calculates a logistic model using MLlib. I compiled the 2.1 today from source and took the version 1 as a precompiled version with Hadoop. The odd thing is that on 1.6.2 the project produces an answer in 350 sec and the 2.1.0

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case). I assume this is still the strategy of it. For some

Re: Spark Web UI port 4040 not working

2016-07-27 Thread Marius Soutier
That's to be expected - the application UI is not started by the master, but by the driver. So the UI will run on the machine that submits the job. > On 26.07.2016, at 15:49, Jestin Ma wrote: > > I did netstat -apn | grep 4040 on machine 6, and I see > > tcp

Re: spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Nicholas Chammas
Yes, spark-ec2 has been removed from the main project, as called out in the Release Notes: http://spark.apache.org/releases/spark-release-2-0-0.html#removals You can still discuss spark-ec2 here or on Stack Overflow, as before. Bug reports and the like should now go on that AMPLab GitHub project

Re: spark 1.6.0 read s3 files error.

2016-07-27 Thread Andy Davidson
Hi Freedafeng The following works for me df will be a data frame. fullPath is lists list of various part files stored in s3. fullPath = ['s3n:///json/StreamingKafkaCollector/s1/2016-07-10/146817304/part-r -0-a2121800-fa5b-44b1-a994-67795' ] from pyspark.sql import SQLContext

Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-27 Thread Steve Rowe
You can see the source for my transformer configurable bridge to Lucene analysis components here, in my company Lucidworks’ spark-solr project: . Here’s a

Spark 2.0 - JavaAFTSurvivalRegressionExample doesn't work

2016-07-27 Thread Robert Goodman
I tried to run the JavaAFTSurvivalRegressionExample on Spark 2.0 and the example doesn't work. It looks like the problem is that the example is using the MLLib Vector/VectorUDT to create the DataSet which needs to be converted using MLUtils before using in the model. I haven't actually tried this

spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Andy Davidson
Congratulations on releasing 2.0! spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever http://spark.apache.org/docs/latest/index.html has a link to the spark-ec2 github repo https://github.com/amplab/spark-ec2 Is this the right group to discuss spark-ec2? Any idea how

spark 1.6.0 read s3 files error.

2016-07-27 Thread freedafeng
cdh 5.7.1. pyspark. codes: === from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName('s3 ---') sc = SparkContext(conf=conf) myRdd = sc.textFile("s3n:///y=2016/m=5/d=26/h=20/2016.5.26.21.9.52.6d53180a-28b9-4e65-b749-b4a2694b9199.json.gz") count =

Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-07-27 Thread janardhan shetty
1. Any links or blogs to develop *custom* transformers ? ex: Tokenizer 2. Any links or blogs to develop *custom* estimators ? ex: any ml algorithm

Re: Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Mich Talebzadeh
Hi Justine. As I understand you are using Spark in standalone mode meaning that you start your master and slaves/worker processes. You can specify the number of works for each node in $SPARK_HOME/conf/spark-env.sh file as below # Options for the daemons used in the standalone deploy mode export

Spark Standalone Cluster: Having a master and worker on the same node

2016-07-27 Thread Jestin Ma
Hi, I'm doing performance testing and currently have 1 master node and 4 worker nodes and am submitting in client mode from a 6th cluster node. I know we can have a master and worker on the same node. Speaking in terms of performance and practicality, is it possible/suggested to have another

Re: read only specific jsons

2016-07-27 Thread vr spark
HI , I tried and getting exception still..any other suggestion? clickDF = cDF.filter(cDF['request.clientIP'].isNotNull()) It fails for some cases and errors our with below message AnalysisException: u'No such struct field clientIP in cookies, nscClientIP1, nscClientIP2, uAgent;' On Tue, Jul

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Timothy Potter
I'm not looking for a one-off solution for a specific query that can be solved on the client side as you suggest, but rather a generic solution that can be implemented within the DataSource impl itself when it knows a sub-query can be pushed down into the engine. In other words, I'd like to

Building Spark 2 from source that does not include the Hive jars

2016-07-27 Thread Mich Talebzadeh
Hi, This has worked before including 1.6.1 etc Build Spark without Hive jars. The idea being to use Spark as Hive execution engine. There is some notes on Hive on Spark: Getting Started The usual process is to

Re: Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Marco Colombo
Why don't you create a dataframe filtered, map it as temporary table and then use it in your query? You can also cache it, of multiple queries on the same inner queries are requested. Il mercoledì 27 luglio 2016, Timothy Potter ha scritto: > Take this simple join: > >

Possible to push sub-queries down into the DataSource impl?

2016-07-27 Thread Timothy Potter
Take this simple join: SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON solr.movie_id = m.movie_id ORDER BY aggCount DESC I would like

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Sun Rui
If you want to keep using RDD API, then you still need to create SparkContext first. If you want to use just Dataset/DataFrame/SQL API, then you can directly create a SparkSession. Generally the SparkContext is hidden although it is internally created and held within the SparkSession. Anytime

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread ayan guha
Because everyone is here discussing this ever-changing-for-better-reason topic of storage formats and serdes, any opinion/thoughts/experience with Apache Arrow? It sounds like a nice idea, but how ready is it? On Wed, Jul 27, 2016 at 11:31 PM, Jörn Franke wrote: > Kudu has

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Jörn Franke
Kudu has been from my impression be designed to offer somethings between hbase and parquet for write intensive loads - it is not faster for warehouse type of querying compared to parquet (merely slower, because that is not its use case). I assume this is still the strategy of it. For some

Spark 2.0 SparkSession, SparkConf, SparkContext

2016-07-27 Thread Jestin Ma
I know that Sparksession is replacing the SQL and HiveContexts, but what about SparkConf and SparkContext? Are those still relevant in our programs? Thank you! Jestin

Re: Is RowMatrix missing in org.apache.spark.ml package?

2016-07-27 Thread Robin East
Can you use the version from mllib? --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action

tpcds for spark2.0

2016-07-27 Thread kevin
hi,all: I want to have a test about tpcds99 sql run on spark2.0. I user https://github.com/databricks/spark-sql-perf about the master version ,when I run :val tpcds = new TPCDS (sqlContext = sqlContext) I got error: scala> val tpcds = new TPCDS (sqlContext = sqlContext) error: missing or invalid

Re: libraryDependencies

2016-07-27 Thread Jacek Laskowski
Hi, How did you reference "sparksample"? If it ended up in /Users/studio/.sbt/0.13/staging/42f93875138543b4e1d3/sparksample I believe it was referenced as a git-based project in sbt. Is that correct? Also, when you "provided" Spark libs you won't be able to run Spark apps in sbt. See

Re: spark

2016-07-27 Thread Jacek Laskowski
Hi, Are you on Java 7 or 8? Can you include the error just before this "Failed to execute"? There was a build issue with spark-test-tags-2.10 once. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Re:Re:Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300
The page mentioned before is the release notes that miss the links http://spark.apache.org/releases/spark-release-2-0-0.html#mllib At 2016-07-27 15:56:00, "prosp4300" wrote: Additionally, in the paragraph about MLlib, three links missed, it is better to provide the

spark

2016-07-27 Thread ناهید بهجتی نجف آبادی
Hi! I have a problem with spark.It's better you notice that I'm very new in these. I downloaded spark-1.6.2 source code and I wana build it. when I try to build spark with "/build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package" this error shows up : Failed to execute goal

Re: The Future Of DStream

2016-07-27 Thread Chang Chen
Things like kafka and user-defined sources are not supported yet, just because Structure Streaming is in alpha stage. Things like sort are not supported because of implementation difficulty, and I don't think DStream can support either What I want to know is the difference between API (or

Re: The Future Of DStream

2016-07-27 Thread Ofir Manor
For the 2.0 release, look for "Unsupported Operations" here: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Also, there are bigger gaps - like no Kafka support, no way to plug user-defined sources or sinks etc Ofir Manor Co-Founder & CTO | Equalum Mobile:

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread u...@moosheimer.com
Hi Gourav, Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in memory db with data storage while Parquet is "only" a columnar storage format. As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... that's more a wish :-). Regards, Uwe Mit freundlichen

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-27 Thread Julien Nauroy
Just a follow-up on my last question: the RowEncoder has to be defined AFTER declaring the columns, or else the new columns won't be serialized and will disappear after the flatMap. So the code should look like: var df1 = spark.read.parquet(fileName) df1 = df1.withColumn("newCol",

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-27 Thread Nick Pentreath
This is exactly the core problem in the linked issue - normally you would use the TrainValidationSplit or CrossValidator to do hyper-parameter selection using cross-validation. You could tune the factor size, regularization parameter and alpha (for implicit preference data), for example. Because

Re: The Future Of DStream

2016-07-27 Thread Chang Chen
I don't understand what kind of low level control that DStream can do while Structure Streaming can not Thanks Chang On Wednesday, July 27, 2016, Matei Zaharia wrote: > Yup, they will definitely coexist. Structured Streaming is currently alpha > and will probably be

Re:Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300
Additionally, in the paragraph about MLlib, three links missed, it is better to provide the links to give us more information, thanks a lot See this blog post for details See this talk to learn more This talk lists many of these new features. 在 2016-07-27 15:18:41,"Ofir Manor"

Re: Setting spark.sql.shuffle.partitions Dynamically

2016-07-27 Thread Takeshi Yamamuro
Hi, How about trying adaptive execution in spark? https://issues.apache.org/jira/browse/SPARK-9850 This feature is turned off by default because it seems experimental. // maropu On Wed, Jul 27, 2016 at 3:26 PM, Brandon White wrote: > Hello, > > My platform runs

Re:Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread prosp4300
Thanks for this immediate correction :) 在 2016-07-27 15:17:54,"Gourav Sengupta" 写道: Sorry, in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning

Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia
Yup, they will definitely coexist. Structured Streaming is currently alpha and will probably be complete in the next few releases, but Spark Streaming will continue to exist, because it gives the user more low-level control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API for

Re: [ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Ofir Manor
Hold the release! There is a minor documentation issue :) But seriously, congrats all on this massive achievement! Anyway, I think it would be very helpful to add a link to the Structured Streaming Developer Guide (Alpha) to both the documentation home page and from the beginning of the "old"

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Sorry, in my email above I was referring to KUDU, and there is goes how can KUDU be right if it is mentioned in forums first with a wrong spelling. Its got a difficult beginning where people were trying to figure out its name. Regards, Gourav Sengupta On Wed, Jul 27, 2016 at 8:15 AM, Gourav

Re: ORC v/s Parquet for Spark 2.0

2016-07-27 Thread Gourav Sengupta
Gosh, whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed that is better than SPARK. Has anyone heard of KUDA? Its better than Parquet. But I think that someone might just start saying that KUDA has difficult lineage as well. After all dynastic rules dictate.

Re:[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread prosp4300
Congratulations! 在 2016-07-27 14:00:22,"Reynold Xin" 写道: Hi all, Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ patches from 300+ contributors. To download Spark 2.0, head over to the download page:

Re: The Future Of DStream

2016-07-27 Thread Ofir Manor
Structured Streaming in 2.0 is declared as alpha - plenty of bits still missing: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html I assume that it will be declared stable / GA in a future 2.x release, and then it will co-exist with DStream for quite a while before

Re: How to give name to Spark jobs shown in Spark UI

2016-07-27 Thread unk1102
Thank Rahul I think you didn't read question properly I have one main spark job which I name using the approach you described. As part of main spark job I create multiple threads which essentially becomes child spark jobs and those jobs has no direct way of naming. On Jul 27, 2016 11:17,

Re: How to export a project to a JAR in Scala IDE for eclipse Correctly?

2016-07-27 Thread Sachin Mittal
Why don't you install sbt and try sbt assembly to create a scala jar. You can using this jar to your spark submit jobs. In case there are additional dependencies these can be passed as --jars (comma separated jar paths) option to spark submit. On Wed, Jul 27, 2016 at 11:53 AM,

Setting spark.sql.shuffle.partitions Dynamically

2016-07-27 Thread Brandon White
Hello, My platform runs hundreds of Spark jobs every day each with its own datasize from 20mb to 20TB. This means that we need to set resources dynamically. One major pain point for doing this is spark.sql.shuffle.partitions, the number of partitions to use when shuffling data for joins or

How to export a project to a JAR in Scala IDE for eclipse Correctly?

2016-07-27 Thread luohui20001
Hi there: I export a project into jar like this "right click my project->choose export ->java-> jar file-> next->choose "src/main/resouces" and ''src/main/scala"'-> clikc browse and choose a jar file export location-> choose overwrite it", and this jar is unable to run with "java -jar

[ANNOUNCE] Announcing Apache Spark 2.0.0

2016-07-27 Thread Reynold Xin
Hi all, Apache Spark 2.0.0 is the first release of Spark 2.x line. It includes 2500+ patches from 300+ contributors. To download Spark 2.0, head over to the download page: http://spark.apache.org/downloads.html To view the release notes: http://spark.apache.org/releases/spark-release-2-0-0.html