Re: retrieving all the rows with collect()

2016-02-10 Thread Ted Yu
Mich: When you execute the statements in Spark shell, you would see the types of the intermediate results. scala> val errlog = sc.textFile("/home/john/s.out") errlog: org.apache.spark.rdd.RDD[String] = /home/john/s.out MapPartitionsRDD[1] at textFile at :24 scala> val sed = errlog.filter(line =>

Re: Rest API for spark

2016-02-10 Thread Ted Yu
Please see this thread: http://search-hadoop.com/m/q3RTtvxWU21wl78x1=Re+Spark+job+submission+REST+API On Wed, Feb 10, 2016 at 3:37 PM, Tracy Li wrote: > Hi Spark Experts, > > I am new for spark and we have requirements to support spark job, jar, sql > etc(submit, manage).

Re: Spark with .NET

2016-02-09 Thread Ted Yu
This thread is related: http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spark+ On Tue, Feb 9, 2016 at 11:43 AM, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I want to use Spark (preferable Spark SQL) using C#. Anyone has any > pointers to that? > > Thanks

Re: Dataset joinWith condition

2016-02-09 Thread Ted Yu
Please take a look at: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala val ds1 = Seq(1, 2, 3).toDS().as("a") val ds2 = Seq(1, 2).toDS().as("b") checkAnswer( ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"), On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju

Re: Spark with .NET

2016-02-09 Thread Ted Yu
nd and use the java api in the backend. >> Warm regards >> Arko >> >> >> On Tue, Feb 9, 2016 at 12:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >>> This thread is related: >>> http://search-hadoop.com/m/q3RTtwp4nR1lugin1=+NET+on+Apache+Spa

Re: Spark Increase in Processing Time

2016-02-09 Thread Ted Yu
e RDD cleanup > should be limited in scope. In the storage tab I see 28 bytes retained in > memory (all other persisted data is size 0). > > > > I will try changing the ttl way up and see if that changes this hockey > stick to a later time. > > > > Do you have other suggestion

Re: Spark with .NET

2016-02-09 Thread Ted Yu
> *Ted* – System.Data.DataSetExtensions is a reference that is > automatically added when a C# project is created in Visual Studio. As > Silvio pointed out below, it is a .NET assembly and not really used by > SparkCLR. > > > > *From:* Silvio Fiorito [mailto:silvio.fior...@gran

Re: Spark 1.6.0 HiveContext NPE

2016-02-05 Thread Ted Yu
Was there any other exception(s) in the client log ? Just want to find the cause for this NPE. Thanks On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] wrote: > I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m > getting a NullPointerException from

Re: Too many open files, why changing ulimit not effecting?

2016-02-05 Thread Ted Yu
bq. and *"session required pam_limits.so"*. What was the second file you modified ? Did you make the change on all the nodes ? Please see the verification step in https://easyengine.io/tutorials/linux/increase-open-files-limit/ On Fri, Feb 5, 2016 at 1:42 AM, Mohamed Nadjib MAMI

Re: library dependencies to run spark local mode

2016-02-04 Thread Ted Yu
Which Spark release are you using ? Is there other clue from the logs ? If so, please pastebin. Cheers On Thu, Feb 4, 2016 at 2:49 AM, Valentin Popov wrote: > Hi all, > > I’m trying run spark on local mode, i using such code: > > SparkConf conf = new

Re: submit spark job with spcified file for driver

2016-02-04 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala You would see '--files' argument On Thu, Feb 4, 2016 at 2:17 PM, alexeyy3 wrote: > Is it possible to specify a file (with key-value properties) when > submitting > spark app

Re: Recommended storage solution for my setup (~5M items, 10KB pr.)

2016-02-04 Thread Ted Yu
bq. had a hard time setting it up Mind sharing your experience in more detail :-) If you already have a hadoop cluster, it should be relatively straight forward to setup. Tuning needs extra effort. On Thu, Feb 4, 2016 at 12:58 PM, habitats wrote: > Hello > > I have ~5

Re: SQL Statement on DataFrame

2016-02-04 Thread Ted Yu
Did you mean using bin/sqlline.py to perform the query ? Have you asked on Phoenix mailing list ? Phoenix has phoenix-spark module. Cheers On Thu, Feb 4, 2016 at 7:28 PM, Nishant Aggarwal wrote: > Dear All, > > I am working on a scenario mentioned below. Need your

Re: Reading large set of files in Spark

2016-02-04 Thread Ted Yu
For question #2, see the following method of FileSystem : public abstract boolean delete(Path f, boolean recursive) throws IOException; FYI On Thu, Feb 4, 2016 at 10:58 AM, Akhilesh Pathodia < pathodia.akhil...@gmail.com> wrote: > Hi, > > I am using Spark to read large set of files from

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-04 Thread Ted Yu
Jay: It would be nice if you can patch Spark with below PR and give it a try. Thanks On Wed, Feb 3, 2016 at 6:03 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Created a pull request: > https://github.com/apache/spark/pull/11066 > > FYI > > On Wed, Feb 3, 2016 at 1:

Re: Using jar bundled log4j.xml on worker nodes

2016-02-04 Thread Ted Yu
Have you taken a look at SPARK-11105 ? Cheers On Thu, Feb 4, 2016 at 9:06 AM, Matthias Niehoff < matthias.nieh...@codecentric.de> wrote: > Hello everybody, > > we’ve bundle our log4j.xml into our jar (in the classpath root). > > I’ve added the log4j.xml to the spark-defaults.conf with > >

Re: Memory tuning in spark sql

2016-02-04 Thread Ted Yu
ioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadE

Re: Memory tuning in spark sql

2016-02-04 Thread Ted Yu
Can you provide a bit more detail ? values of the parameters you have tuned log snippets from executors snippet of your code Thanks On Thu, Feb 4, 2016 at 9:48 AM, wrote: > Hi Sir/madam, > Greetings of the day. > > I am working on Spark 1.6.0 with AWS EMR(Elastic

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
s > null now, just from upgrading Spark to 1.6.0. > > From: Ted Yu <yuzhih...@gmail.com> > Date: Wednesday, February 3, 2016 at 12:04 PM > To: Jay Shipper <shipper_...@bah.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: [E

Re: clear cache using spark sql cli

2016-02-03 Thread Ted Yu
Have you looked at SPARK-5909 Add a clearCache command to Spark SQL's cache manager On Wed, Feb 3, 2016 at 7:16 PM, fightf...@163.com wrote: > Hi, > How could I clear cache (execute sql query without any cache) using spark > sql cli ? > Is there any command available ? >

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
concept that can be posted with the > bug. > > From: Ted Yu <yuzhih...@gmail.com> > Date: Wednesday, February 3, 2016 at 3:57 PM > To: Jay Shipper <shipper_...@bah.com> > Cc: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: [External] Re:

Re: Re: clear cache using spark sql cli

2016-02-03 Thread Ted Yu
sqlContext.clearCache() > Is this right ? In spark-sql cli I can only run some sql queries. So I > want to see if there > are any available options to reach this. > > Best, > Sun. > > ------ > fightf...@163.com > > > *From:* Ted Yu <yu

Re: Spark 1.5.2 memory error

2016-02-03 Thread Ted Yu
Feb 3, 2016 at 1:33 PM, Mohammed Guller <moham...@glassbeam.com> >>> wrote: >>> >>>> Nirav, >>>> >>>> Sorry to hear about your experience with Spark; however, sucks is a >>>> very strong word. Many organizations are processing a lot m

Re: Cassandra BEGIN BATCH

2016-02-03 Thread Ted Yu
Seems you can find faster response on Cassandra Connector mailing list. On Wed, Feb 3, 2016 at 1:45 PM, FrankFlaherty wrote: > Cassandra provides "BEGIN BATCH" and "APPLY BATCH" to perform atomic > execution of multiple statements as below: > > BEGIN BATCH > INSERT

Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Ted Yu
Looks like the NPE came from this line: def conf: HiveConf = SessionState.get().getConf Meaning SessionState.get() returned null. On Wed, Feb 3, 2016 at 8:33 AM, Shipper, Jay [USA] wrote: > I’m upgrading an application from Spark 1.4.1 to Spark 1.6.0, and I’m > getting a

Re: Spark 1.5.2 memory error

2016-02-02 Thread Ted Yu
What value do you use for spark.yarn.executor.memoryOverhead ? Please see https://spark.apache.org/docs/latest/running-on-yarn.html for description of the parameter. Which Spark release are you using ? Cheers On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky wrote: > Can you

Re: Error trying to get DF for Hive table stored HBase

2016-02-02 Thread Ted Yu
Looks like this is related: HIVE-12406 FYI On Tue, Feb 2, 2016 at 1:40 PM, Doug Balog wrote: > I’m trying to create a DF for an external Hive table that is in HBase. > I get the a NoSuchMethodError >

Re: Getting the size of a broadcast variable

2016-02-02 Thread Ted Yu
There is chance that the log message may change in future releases. Log snooping would be broken. FYI On Mon, Feb 1, 2016 at 9:55 PM, Takeshi Yamamuro wrote: > Hi, > > Currently, there is no way to check the size except for snooping INFO-logs > in a driver; > > 16/02/02

Re: Master failover results in running job marked as "WAITING"

2016-02-02 Thread Ted Yu
bq. Failed to connect to master XXX:7077 Is the 'XXX' above the hostname for the new master ? Thanks On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang wrote: > Hi - > > I'm running Spark 1.5.2 in standalone mode with multiple masters using > zookeeper for failover. The

Re: SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Ted Yu
As the message (from SparkConf.scala) showed, you shouldn't use SPARK_WORKER_INSTANCES any more. FYI On Mon, Feb 1, 2016 at 2:19 PM, Lin, Hao wrote: > Can I still use SPARK_WORKER_INSTANCES in conf/spark-env.sh? the > following is what I’ve got after trying to set this

Re: SPARK_WORKER_INSTANCES deprecated

2016-02-01 Thread Ted Yu
://spark.apache.org/docs/1.5.2/spark-standalone.html > > > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Monday, February 01, 2016 5:45 PM > *To:* Lin, Hao > *Cc:* user > *Subject:* Re: SPARK_WORKER_INSTANCES deprecated > > > >

Re: Spark Standalone cluster job to connect Hbase is Stuck

2016-02-01 Thread Ted Yu
Is the hbase-site.xml on the classpath of the worker nodes ? Which Spark release are you using ? Cheers On Mon, Feb 1, 2016 at 4:25 PM, sudhir patil wrote: > Spark job on Standalone cluster is Stuck, shows no logs after > "util.AkkaUtils: Connecting to

Re: Spark Standalone cluster job to connect Hbase is Stuck

2016-02-01 Thread Ted Yu
.2, as i cannot upgrade cluster. > https://issues.apache.org/jira/browse/SPARK-6918 > > > > On Tue, Feb 2, 2016 at 8:31 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Is the hbase-site.xml on the classpath of the worker nodes ? >> >> Which Spark release are you u

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Ted Yu
at's another thing: that the Record case class should be outside. I ran > it as spark-submit. > > Thanks, Alex. > > On Mon, Feb 1, 2016 at 6:41 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Running your sample in spark-shell built in master branch, I got: >> >

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Ted Yu
On Mon, Feb 1, 2016 at 6:03 PM, Alexandr Dzhagriev <dzh...@gmail.com> > wrote: > >> Hi Ted, >> >> That doesn't help neither as one method delegates to another as far as I >> can see: >> >> def collect_list(columnName: String): Column = >> collect_l

Re: Failed to 'collect_set' with dataset in spark 1.6

2016-02-01 Thread Ted Yu
bq. agg(collect_list("b") Have you tried: agg(collect_list($"b") On Mon, Feb 1, 2016 at 8:50 AM, Alexandr Dzhagriev wrote: > Hello, > > I'm trying to run the following example code: > > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.{SparkContext,

Re: how to covert millisecond time to SQL timeStamp

2016-02-01 Thread Ted Yu
See related thread on using Joda DateTime: http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+ when+using+Joda+DateTime On Mon, Feb 1, 2016 at 7:44 PM, Kevin Mellott wrote: > I've had pretty good success using Joda-Time >

Re: Can't view executor logs in web UI on Windows

2016-02-01 Thread Ted Yu
I did a brief search but didn't find relevant JIRA either. You can create a JIRA and submit pull request for the fix. Cheers > On Feb 1, 2016, at 5:13 AM, Mark Pavey wrote: > > I am running Spark on Windows. When I try to view the Executor logs in the UI > I get

Re: Spark Executor retries infinitely

2016-02-01 Thread Ted Yu
I haven't found config knob for controlling the retry count after brief search. According to http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html , default value for -XX:ParallelGCThreads= seems to be 8. This seems to explain why you got the VM initialization error. FYI On Mon, Feb

Re: ZlibFactor warning

2016-01-29 Thread Ted Yu
Did the stack trace look like the one from: https://issues.apache.org/jira/browse/HADOOP-12638 Cheers > On Jan 27, 2016, at 1:29 AM, Eli Super wrote: > > > Hi > > I'm running spark locally on win 2012 R2 server > > No hadoop installed > > I'm getting following error :

Re: Spark Algorithms as WEB Application

2016-01-29 Thread Ted Yu
Have you looked at: http://wiki.apache.org/tomcat/OutOfMemory Cheers > On Jan 29, 2016, at 2:44 AM, rahulganesh wrote: > > Hi, > I am currently working on a web application which will call the spark mllib > algorithms using JERSEY ( REST API ). The problem that i am

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-29 Thread Ted Yu
bmit to help avoid version > mismatches in future Spark versions, but that doesn't help my current > situation between 1.5.1 and 1.5.2. > > Any other ideas? Thanks. > On Thu, Jan 28, 2016 at 5:06 PM Ted Yu <yuzhih...@gmail.com> wrote: > >> I am not Scala expert. >&

Re: Parquet block size from spark-sql cli

2016-01-28 Thread Ted Yu
Have you tried the following (sc is SparkContext)? sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE) On Thu, Jan 28, 2016 at 9:16 AM, ubet wrote: > Can I set the Parquet block size (parquet.block.size) in spark-sql. We are > loading about 80 table partitions

Re: local class incompatible: stream classdesc serialVersionUID

2016-01-28 Thread Ted Yu
I am not Scala expert. RDD extends Serializable but doesn't have @SerialVersionUID() annotation. This may explain what you described. One approach is to add @SerialVersionUID so that RDD's have stable serial version UID. Cheers On Thu, Jan 28, 2016 at 1:38 PM, Jason Plurad

Re: streaming in 1.6.0 slower than 1.5.1

2016-01-28 Thread Ted Yu
bq. The total size by class B is 3GB in 1.5.1 and only 60MB in 1.6.0. >From the information you posted, it seems the above is backwards. BTW [B is byte[], not class B. FYI On Thu, Jan 28, 2016 at 11:49 AM, Jesse F Chen wrote: > I ran the same streaming application

Re: building spark 1.6.0 fails

2016-01-28 Thread Ted Yu
I tried the following command: build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 -Dhadoop.version=2.7.0 package -DskipTests I didn't encounter the error you mentioned. bq. Using zinc server for incremental compilation Was it possible that zinc was running before you started the

Re: Getting Exceptions/WARN during random runs for same dataset

2016-01-28 Thread Ted Yu
Did the UnsupportedOperationException's happen from the executors on all the nodes or only one node ? Thanks On Thu, Jan 28, 2016 at 5:13 PM, Khusro Siddiqui wrote: > Hi Everyone, > > Environment used: Datastax Enterprise 4.8.3 which is bundled with Spark > 1.4.1 and scala

Re: Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-28 Thread Ted Yu
Looks like '--properties-file' is no longer supported. Was it possible that Spark 1.3.1 artifact / dependency leaked into your app ? Cheers On Thu, Jan 28, 2016 at 7:36 PM, Nirav Patel wrote: > Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client >

Re: Maintain state outside rdd

2016-01-27 Thread Ted Yu
Have you looked at this method ? * Zips this RDD with its element indices. The ordering is first based on the partition index ... def zipWithIndex(): RDD[(T, Long)] = withScope { On Wed, Jan 27, 2016 at 6:03 PM, Krishna wrote: > Hi, > > I've a scenario where I need

Re: Maintain state outside rdd

2016-01-27 Thread Ted Yu
e of some > variable during map(..) phase. > I simplified the scenario in my example by making row_index() increment > "incr" by 1 but in reality, the change to "incr" can be anything. > > On Wed, Jan 27, 2016 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote: >

Re: Having issue with Spark SQL JDBC on hive table !!!

2016-01-27 Thread Ted Yu
In the last snippet, temptable is shown by 'show tables' command. Yet you queried tampTable. I believe this just was typo :-) On Wed, Jan 27, 2016 at 7:07 AM, @Sanjiv Singh wrote: > Hi All, > > I have configured Spark to query on hive table. > > Run the Thrift JDBC/ODBC

Re: org.netezza.error.NzSQLException: ERROR: Invalid datatype - TEXT

2016-01-26 Thread Ted Yu
Please take a look at getJDBCType() in: sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala You can register dialect for Netezza as shown in sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala Cheers On Tue, Jan 26, 2016 at 7:26 AM, kali.tumm...@gmail.com <

Re: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Ted Yu
Were you using HiveContext or SQLContext ? Can you show the complete stack trace ? Thanks On Tue, Jan 26, 2016 at 8:00 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Hi, > > > > I’m running CTAS, and it fails with “Error: java.lang.AssertionError: > assertion failed: No plan for

Re: ctas fails with "No plan for CreateTableAsSelect"

2016-01-26 Thread Ted Yu
On Tue, Jan 26, 2016 at 8:06 AM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > SQL on beeline and connecting to the thriftserver. > > > > Younes > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* January-26-16 11:05 AM > *To:* Yo

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread Ted Yu
I wonder if the following change would solve the problem you described (by shading jackson.core): diff --git a/pom.xml b/pom.xml index fb77506..32a3237 100644 --- a/pom.xml +++ b/pom.xml @@ -2177,6 +2177,7 @@ org.eclipse.jetty:jetty-util

Re: NoSuchMethod from transitive dependency jackson-databind in MaxMind GeoIP2

2016-01-26 Thread Ted Yu
Jan 26, 2016 at 10:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> I wonder if the following change would solve the problem you described >> (by shading jackson.core): >> >> diff --git a/pom.xml b/pom.xml >> index fb77506..32a3237 10064

Re: withColumn

2016-01-26 Thread Ted Yu
A brief search among the Spark source code showed no support for referencing column the way shown in your code. Are you trying to do a join ? Cheers On Tue, Jan 26, 2016 at 1:04 PM, naga sharathrayapati < sharathrayap...@gmail.com> wrote: > I was trying to append a Column to a dataframe df2 by

Re: Issues with Long subtraction in an RDD when utilising tailrecursion

2016-01-26 Thread Ted Yu
bq. (successfulAuction.timestampNanos - auction.timestampNanos) < 1000L && Have you included the above condition into consideration when inspecting timestamps of the results ? On Tue, Jan 26, 2016 at 1:10 PM, Nkechi Achara wrote: > > > down votefavorite >

Re: Spark SQL joins taking too long

2016-01-26 Thread Ted Yu
What's the type of shape column ? Can you disclose what SomeUDF does (by showing the code) ? Cheers On Tue, Jan 26, 2016 at 12:41 PM, raghukiran wrote: > Hi, > > I create two tables, one counties with just one row (it actually has 2k > rows, but I used only one) and

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
nt stages I have seen it in. > > Thanks for the help, I was able to do it for 0.1% of my data. I will create > the JIRA. > > Thanks, > Isaac > > On Jan 25, 2016, at 8:51 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Opening a JIRA is fine. >> >> See

Re: 10hrs of Scheduler Delay

2016-01-25 Thread Ted Yu
> Date: 01/24/2016 2:54 PM (GMT-05:00) > To: Renu Yadav <yren...@gmail.com> > Cc: Darren Govoni <dar...@ontrenet.com>, Muthu Jayakumar > <bablo...@gmail.com>, Ted Yu <yuzhih...@gmail.com>, user@spark.apache.org > Subject: Re: 10hrs of Scheduler Delay > > I

Re: how to build spark with out hive

2016-01-25 Thread Ted Yu
Spark 1.5.2. depends on slf4j 1.7.10 Looks like there was another version of slf4j on the classpath. FYI On Mon, Jan 25, 2016 at 12:19 AM, kevin wrote: > HI,all > I need to test hive on spark ,to use spark as the hive's execute > engine. > I download the spark

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Ted Yu
Have you noticed the following method of HiveContext ? * Returns a new HiveContext as new session, which will have separated SQLConf, UDF/UDAF, * temporary tables and SessionState, but sharing the same CacheManager, IsolatedClientLoader * and Hive client (both of execution and metadata)

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
Have you read this thread ? http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE+hadoop+hdfs+cache+question+do+client+processes+share+cache+ Cheers On Mon, Jan 25, 2016 at 1:23 PM, Jia Zou wrote: > I configured HDFS to cache file in HDFS's cache, like following:

Re: Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Ted Yu
is used, you can accidentally use too much RAM on the host, resulting in OOM in the JVM which is hard to debug. Cheers On Mon, Jan 25, 2016 at 1:39 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you read this thread ? > > > http://search-hadoop.com/m/uOzYttXZcg1M6oKf2/HDFS+cache=RE

Re: Spark master takes more time with local[8] than local[1]

2016-01-24 Thread Ted Yu
bq. I'm reading a 4.3 GB file The contents of the file can be held in one executor. Can you try files with much larger size ? Cheers On Sun, Jan 24, 2016 at 12:11 PM, jimitkr wrote: > Hi All, > > I have a machine with the following configuration: > 32 GB RAM > 500 GB HDD

Re: concurrent.RejectedExecutionException

2016-01-23 Thread Ted Yu
This seems related: SPARK-8029 ShuffleMapTasks must be robust to concurrent attempts on the same executor Mind trying out 1.5.3 or later release ? Cheers On Sat, Jan 23, 2016 at 12:51 AM, Yasemin Kaya wrote: > Hi all, > > I'm using spark 1.5 and getting this error. Could

Re: python - list objects in HDFS directory

2016-01-23 Thread Ted Yu
Is 'hadoop' / 'hdfs' command accessible to your python script ? If so, you can call 'hdfs dfs -ls' from python. Cheers On Sat, Jan 23, 2016 at 4:08 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I would like to make a list of files (parquet or json) in a specific >

Re: Concatenating tables

2016-01-23 Thread Ted Yu
How about this operation : * Returns a new [[DataFrame]] containing union of rows in this frame and another frame. * This is equivalent to `UNION ALL` in SQL. * @group dfops * @since 1.3.0 */ def unionAll(other: DataFrame): DataFrame = withPlan { FYI On Sat, Jan 23, 2016 at

Re: Disable speculative retry only for specific stages?

2016-01-22 Thread Ted Yu
Looked at: https://spark.apache.org/docs/latest/configuration.html I don't think Spark supports per stage speculation. On Fri, Jan 22, 2016 at 10:15 AM, Adam McElwee wrote: > I've used speculative execution a couple times in the past w/ good > results, but I have one stage in

Re: storing query object

2016-01-22 Thread Ted Yu
There have been optimizations in this area, such as: https://issues.apache.org/jira/browse/SPARK-8125 You can also look at parent issue. Which Spark release are you using ? > On Jan 22, 2016, at 1:08 AM, Gourav Sengupta > wrote: > > > Hi, > > I have a SPARK

Re: storing query object

2016-01-22 Thread Ted Yu
ction of meta-data so that it does not take such a > long time. > > Please advice. > > Regards, > Gourav > > > > > On Fri, Jan 22, 2016 at 10:15 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> There have been optimizations in this area, such as: >

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ted Yu
get this error, > > but if I do : > rdd.call_to_hbase.collect it throws this error. > > On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale <kaleajin...@gmail.com> > wrote: > >> Unfortunately I cannot at this moment (not a decision I can make) :( >> >> On Wed

Re: Spark Streaming : requirement failed: numRecords must not be negative

2016-01-22 Thread Ted Yu
Is it possible to reproduce the condition below with test code ? Thanks On Fri, Jan 22, 2016 at 7:31 AM, Afshartous, Nick wrote: > > Hello, > > > We have a streaming job that consistently fails with the trace below. > This is on an AWS EMR 4.2/Spark 1.5.2 cluster. > >

Re: Date / time stuff with spark.

2016-01-22 Thread Ted Yu
Related thread: http://search-hadoop.com/m/q3RTtSfi342nveex1=RE+NPE+when+using+Joda+DateTime FYI On Fri, Jan 22, 2016 at 6:50 AM, Spencer, Alex (Santander) < alex.spen...@santander.co.uk.invalid> wrote: > Hi Andy, > > Sorry this is in Scala but you may be able to do something similar? I use >

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Vivek: I searched for 'cassandra gc pause' and found a few hits. e.g. : http://search-hadoop.com/m/qZFqM1c5nrn1Ihwf6=Re+GC+pauses+affecting+entire+cluster+ Keep in mind the effect of GC on shared nodes. FYI On Fri, Jan 22, 2016 at 7:09 PM, Mohammed Guller wrote: > For

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
-6 and memory usage > from 1 - 4 gb. > > We have budget to use higher CPU or higher memory systems hence was > planning to have them together on more efficient nodes. > > Regards > Vivek > On Sat, Jan 23, 2016 at 7:13 am, Ted Yu <yuzhih...@gmail.com> wrote: > >

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
s > Vivek > On Sat, Jan 23, 2016 at 7:57 am, Ted Yu <yuzhih...@gmail.com> wrote: > > From your description, putting Cassandra daemon on Spark cluster should > be feasible. > > One aspect to be measured is how much locality can be achieved in this > setup - Cassandra is dis

Re: Spark Cassandra clusters

2016-01-22 Thread Ted Yu
Can you give us a bit more information ? How much memory does each node have ? What's the current heap allocation for Cassandra process and executor ? Spark / Cassandra release you are using Thanks On Fri, Jan 22, 2016 at 5:37 PM, wrote: > Hi All, > What is the

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu
Please also check AppMaster log. Thanks > On Jan 21, 2016, at 3:51 AM, Akhil Das wrote: > > Can you look in the executor logs and see why the sparkcontext is being > shutdown? Similar discussion happened here previously. >

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu
Looks like jar containing EsHadoopIllegalArgumentException class wasn't in the classpath. Can you double check ? Which Spark version are you using ? Cheers On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz wrote: > I'm runing a Spark Streaming process and it stops in a

Re: Spark job stops after a while.

2016-01-21 Thread Ted Yu
at in yarn-client although it has the error it doesn't stop the > execution, but I don't know why. > > > > 2016-01-21 15:55 GMT+01:00 Ted Yu <yuzhih...@gmail.com>: > >> Looks like jar containing EsHadoopIllegalArgumentException class wasn't >> in the classpath. >

Re: spark job submisson on yarn-cluster mode failing

2016-01-21 Thread Ted Yu
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(T

No plan for BroadcastHint when attempting broadcastjoin

2016-01-21 Thread Ted Yu
at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run

Re: java.lang.ArrayIndexOutOfBoundsException when attempting broadcastjoin

2016-01-21 Thread Ted Yu
You were using Kryo serialization ? If you switch to Java serialization, your job should run fine. Which Spark release are you using ? Thanks On Thu, Jan 21, 2016 at 6:59 AM, sebastian.piu wrote: > Hi all, > > I'm trying to work out a problem when using Spark

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
Can you provide a bit more information ? command line for submitting Spark job version of Spark anything interesting from driver / executor logs ? Thanks On Thu, Jan 21, 2016 at 7:35 AM, Sanders, Isaac B wrote: > Hey all, > > I am a CS student in the United States

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-21 Thread Ted Yu
ooks like extending my batch duration to 7 seconds is a > work-around. I'd like to build a check for the lack of checkpointing in > our integration tests. Is there a way to parse the DAG at runtime? > > On Wed, Jan 20, 2016 at 2:01 PM Ted Yu <yuzhih...@gmail.com> wrote: > >&g

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
is 1.4.1 > > The logs are full of standard fair, nothing like an exception or even > interesting [INFO] lines. > > Here is the script I am using: > https://gist.github.com/isaacsanders/660f480810fbc07d4df2 > > Thanks > Isaac > > On Jan 21, 2016, at 11:03 AM, Ted Yu <yuzhih..

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
t; Hadoop is: HDP 2.3.2.0-2950 > > Here is a gist (pastebin) of my versions en masse and a stacktrace: > https://gist.github.com/isaacsanders/2e59131758469097651b > > Thanks > > On Jan 21, 2016, at 7:44 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Looks like you wer

Re: 10hrs of Scheduler Delay

2016-01-21 Thread Ted Yu
itouka/spark/dbscan/exploratoryAnalysis/DistanceToNearestNeighborDriver.scala > > - Isaac > > On Jan 21, 2016, at 10:08 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > You may have noticed the following - did this indicate prolonged > computation in your code ? > &g

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ted Yu
0.98.0 didn't have fix from HBASE-8 Please upgrade your hbase version and try again. If still there is problem, please pastebin the stack trace. Thanks On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale wrote: > > I have posted this on hbase user list but i thought

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ted Yu
_CLASSPATH didnt work for me. > > On Wed, Jan 20, 2016 at 6:14 PM Ted Yu <yuzhih...@gmail.com> wrote: > >> 0.98.0 didn't have fix from HBASE-8 >> >> Please upgrade your hbase version and try again. >> >> If still there is problem, please pastebin t

Re: Using Spark, SparkR and Ranger, please help.

2016-01-20 Thread Ted Yu
The tail of the stack trace seems to be chopped off. Can you include the whole trace ? Which version of Spark / Hive / Ranger are you using ? Cheers On Wed, Jan 20, 2016 at 9:42 AM, Julien Carme wrote: > Hello, > > I have been able to use Spark with Apache Ranger. I

Re: How to use scala.math.Ordering in java

2016-01-20 Thread Ted Yu
Please take a look at the following files for some examples: sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java Cheers On Wed, Jan 20, 2016 at 1:03 AM, ddav

Re: updateStateByKey not persisting in Spark 1.5.1

2016-01-20 Thread Ted Yu
This is related: SPARK-6847 FYI On Wed, Jan 20, 2016 at 7:55 AM, Brian London wrote: > I'm running a streaming job that has two calls to updateStateByKey. When > run in standalone mode both calls to updateStateByKey behave as expected. > When run on a cluster,

Re: is Hbase Scan really need thorough Get (Hbase+solr+spark)

2016-01-19 Thread Ted Yu
get(List gets) will call: Object [] r1 = batch((List)gets); where batch() would do: AsyncRequestFuture ars = multiAp.submitAll(pool, tableName, actions, null, results); ars.waitUntilDone(); multiAp is an AsyncProcess. In short, client would access region server for the results.

Re: Is there a test like MiniCluster example in Spark just like hadoop ?

2016-01-18 Thread Ted Yu
Please refer to the following suites: yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala core/src/test/scala/org/apache/spark/scheduler/SparkListenerWithClusterSuite.scala Cheers On Mon, Jan 18, 2016 at 2:14 AM, zml张明磊 wrote: > Hello, > > > >

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-18 Thread Ted Yu
Can you pastebin master log before the error showed up ? The initial message was posted for Spark 1.2.0 Which release of Spark / zookeeper do you use ? Thanks On Mon, Jan 18, 2016 at 6:47 AM, doctorx wrote: > Hi, > I am facing the same issue, with the given error >

Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Ted Yu
externalCallTwo map { dataTwo => println("in map") // prints, so it gets here ... val rddOne = sparkContext.parallelize(dataOne) I don't think you should call method on sparkContext in map function. sparkContext lives on driver side. Cheers On Mon, Jan 18, 2016 at 6:27 AM,

Re: Spark SQL create table

2016-01-18 Thread Ted Yu
Have you taken a look at sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala ? You can find examples there. On Mon, Jan 18, 2016 at 9:57 AM, raghukiran wrote: > Is creating a table using the SparkSQLContext currently supported? > > Regards, > Raghu > > > >

<    1   2   3   4   5   6   7   8   9   10   >