Re: Thread-safety of a SparkListener

2016-04-01 Thread Ted Yu
In general, you should implement thread-safety in your code. Which set of events are you interested in ? Cheers On Fri, Apr 1, 2016 at 9:23 AM, Truong Duc Kien wrote: > Hi, > > I need to gather some metrics using a SparkListener. Does the callback > methods need to

Re: [SQL] A bug with withColumn?

2016-03-31 Thread Ted Yu
Looks like this is result of the following check: val shouldReplace = output.exists(f => resolver(f.name, colName)) if (shouldReplace) { where existing column, text, was replaced. On Thu, Mar 31, 2016 at 12:08 PM, Jacek Laskowski wrote: > Hi, > > Just ran into the

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

2016-03-31 Thread Ted Yu
Can you show the stack trace ? The log message came from DiskBlockObjectWriter#revertPartialWritesAndClose(). Unfortunately, the method doesn't throw exception, making it a bit hard for caller to know of the disk full condition. On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Please exclude jackson-databind - that was where the AnnotationMap class comes from. On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa < marcelo.oik...@webradar.com> wrote: > Hi, Alonso. > > As you can see jackson-core is provided by several libraries, try to >> exclude it from spark-core, i

Re: Problem with jackson lib running on spark

2016-03-31 Thread Ted Yu
Spark 1.6.1 uses this version of jackson: 2.4.4 Looks like Tranquility uses different version of jackson. How do you build your jar ? Consider using maven-shade-plugin to resolve the conflict if you use maven. Cheers On Thu, Mar 31, 2016 at 9:50 AM, Marcelo Oikawa

Re: Select per Dataset attribute (Scala) not possible? Why no Seq().as[type] for Datasets?

2016-03-31 Thread Ted Yu
I tried this: scala> final case class Text(id: Int, text: String) warning: there was one unchecked warning; re-run with -unchecked for details defined class Text scala> val ds = Seq(Text(0, "hello"), Text(1, "world")).toDF.as[Text] ds: org.apache.spark.sql.Dataset[Text] = [id: int, text: string]

Re: Unable to Run Spark Streaming Job in Hadoop YARN mode

2016-03-31 Thread Ted Yu
Looking through https://spark.apache.org/docs/latest/configuration.html#spark-streaming , I don't see config specific to YARN. Can you pastebin the exception you saw ? When the job stopped, was there any error ? Thanks On Wed, Mar 30, 2016 at 10:57 PM, Soni spark

Re: Loading multiple packages while starting spark-shell

2016-03-30 Thread Ted Yu
How did you specify the packages ? See the following from https://spark.apache.org/docs/latest/submitting-applications.html : Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. On Wed, Mar 30, 2016 at 7:15 AM, Mustafa Elbehery

Re: spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Ted Yu
Have you tried the following construct ? new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey() See core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala On Wed, Mar 30, 2016 at 5:20 AM, Nirav Patel wrote: > Hi, I am trying to use filterByRange feature of

Re: Unable to set cores while submitting Spark job

2016-03-30 Thread Ted Yu
-c CORES, --cores CORES Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker bq. sc.getConf().set() I think you should use this pattern (shown in https://spark.apache.org/docs/latest/spark-standalone.html): val conf = new SparkConf()

Re: Unable to execute query on SAPHANA using SPARK

2016-03-29 Thread Ted Yu
As the error said, com.sap.db.jdbc.topology.Host is not serializable. Maybe post question on Sap Hana mailing list (if any) ? On Tue, Mar 29, 2016 at 7:54 AM, reena upadhyay < reena.upadh...@impetus.co.in> wrote: > I am trying to execute query using spark sql on SAP HANA from spark > shell. I

Re: How to reduce the Executor Computing Time.

2016-03-29 Thread Ted Yu
Can you disclose snippet of your code ? Which Spark release do you use ? Thanks > On Mar 29, 2016, at 3:42 AM, Charan Adabala wrote: > > From the below image how can we reduce the computing time for the stages, at > some stages the Executor Computing Time is less than

Re: PySpark saving to MongoDB: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)

2016-03-28 Thread Ted Yu
See this method: lazy val rdd: RDD[T] = { On Mon, Mar 28, 2016 at 6:30 PM, Russell Jurney wrote: > Ok, I'm also unable to save to Elasticsearch using a dataframe's RDD. This > seems related to DataFrames. Is there a way to convert a DataFrame's RDD to > a 'normal'

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Ted Yu
Can you describe what gets triggered by triggerAndWait ? Cheers On Mon, Mar 28, 2016 at 1:39 PM, kpeng1 wrote: > Hi All, > > I am currently trying to debug a spark application written in scala. I > have > a main method: > def main(args: Array[String]) { > ... >

Re: org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray error in nelwy build Hbase

2016-03-28 Thread Ted Yu
Dropping dev@ Can you provide a bit more information ? release of hbase release of hadoop I assume you're running on Linux. Any change in Linux setup before the exception showed up ? On Mon, Mar 28, 2016 at 10:30 AM, beeshma r wrote: > Hi > i am testing with newly build

Re: Aggregate subsequenty x row values together.

2016-03-28 Thread Ted Yu
Can you describe your use case a bit more ? Since the row keys are not sorted in your example, there is a chance that you get indeterministic results when you aggregate on groups of two successive rows. Thanks On Mon, Mar 28, 2016 at 9:21 AM, sujeet jog wrote: > Hi, > >

Re: Custom RDD in spark, cannot find custom method

2016-03-28 Thread Ted Yu
tion and add the MyRDD.scala to the >> project then the custom method can be called in the main function and it >> works. >> I misunderstand the usage of custom rdd, the custom rdd does not have to be >> written to the spark project like UnionRDD, CogroupedRDD, and

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
; *4 myrdd: org.apache.spark.rdd.RDD[(Int, String)] = MyRDD[3]* at > customable at > 5 :28 > 6 scala> *myrdd.customMethod(bulk)* > *7 error: value customMethod is not a member of > org.apache.spark.rdd.RDD[(Int, String)]* > > On Mon, Mar 28, 2016 at 12:50 AM, Ted Yu <yu

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
dd.customMethod(bulk)* > *error: value customMethod is not a member of > org.apache.spark.rdd.RDD[(Int, String)]* > > and the customable method in PairRDDFunctions.scala is > > def customable(partitioner: Partitioner): RDD[(K, V)] = self.withScope { > new MyRDD[K, V](self, partition

Re: Custom RDD in spark, cannot find custom method

2016-03-27 Thread Ted Yu
Can you show the full stack trace (or top 10 lines) and the snippet using your MyRDD ? Thanks On Sun, Mar 27, 2016 at 9:22 AM, Tenghuan He wrote: > ​Hi everyone, > > I am creating a custom RDD which extends RDD and add a custom method, > however the custom method

Re: whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-27 Thread Ted Yu
Please take a look at the MyRDD class in: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala There is scaladoc for the class. See how getPreferredLocations() is implemented. Cheers On Sun, Mar 27, 2016 at 2:01 AM, chenyong wrote: > Thank you Ted for

Re: Hive on Spark engine

2016-03-26 Thread Ted Yu
According to: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_HDP_RelNotes/bk_HDP_RelNotes-20151221.pdf Spark 1.5.2 comes out of box. Suggest moving questions on HDP to Hortonworks forum. Cheers On Sat, Mar 26, 2016 at 3:32 PM, Mich Talebzadeh wrote: >

Re: whether a certain piece can be assigned to a specicified node by some codes in my program.

2016-03-26 Thread Ted Yu
Please take a look at the following method: /** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */ final def preferredLocations(split: Partition): Seq[String] = { checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-26 Thread Ted Yu
%20Scala.pdf > > > -- Forwarded message -- > From: Ted Yu <yuzhih...@gmail.com> > Date: 26 March 2016 at 12:51 > Subject: Re: Any plans to migrate Transformer API to Spark SQL (closer to > DataFrames)? > To: Michał Zieliński <zielinski.mich...@gmail.co

Re: Is this expected in Spark 1.6.1, derby.log file created when spark shell starts

2016-03-26 Thread Ted Yu
Same with master branch. I found derby.log in the following two files: .gitignore:derby.log dev/.rat-excludes:derby.log FYI On Sat, Mar 26, 2016 at 4:09 AM, Mich Talebzadeh wrote: > Having moved to Spark 1.6.1, I have noticed thar whenerver I start a > spark-sql or

Re: Hive table created by Spark seems to end up in default

2016-03-25 Thread Ted Yu
Session management has improved in 1.6.x (see SPARK-10810) Mind giving 1.6.1 a try ? Thanks On Fri, Mar 25, 2016 at 3:48 PM, Mich Talebzadeh wrote: > I have noticed that the only sure way to specify a Hive table from Spark > is to prefix it with database (DBName)

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Jd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 March 2016 at 22:40, Ted Yu <yuzhih...@gmail.com> wrote: > >> Looks like database support

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
finds and gets the table > info OK > > HTH > > > On Friday, 25 March 2016, 22:32, Ted Yu <yuzhih...@gmail.com> wrote: > > > Which release of Spark do you use, Mich ? > > In master branch, the message is more accurate > (sql/catalyst/src/main/scala/org/apa

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Which release of Spark do you use, Mich ? In master branch, the message is more accurate (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala): override def getMessage: String = s"Table $table not found in database $db" On Fri, Mar 25, 2016 at 3:21

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
efinitely solve my problem > > I have one more question .. if i want to launch a spark application in > production environment so is there any other way so multiple users can > submit there job without having hadoop configuration . > > Regards > Prateek > > > On

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtAvwgE7dEI02 On Fri, Mar 25, 2016 at 10:39 AM, prateek arora wrote: > Hi > > I want to submit spark application from outside of spark clusters . so > please help me to provide a information regarding this. > >

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Ted Yu
This is the original subject of the JIRA: Partition discovery fail if there is a _SUCCESS file in the table's root dir If I remember correctly, there were discussions on how (traditional) partition discovery slowed down Spark jobs. Cheers On Fri, Mar 25, 2016 at 10:15 AM, suresk

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Do you mind showing body of TO_DATE() ? Thanks On Fri, Mar 25, 2016 at 7:38 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Looks like you forgot an import for Date. > > FYI > > On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: >

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Looks like you forgot an import for Date. FYI On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh wrote: > > > Hi, > > writing a UDF to convert a string into Date > > def ChangeDate(word : String) : Date = { > | return >

Re: Best way to determine # of workers

2016-03-25 Thread Ted Yu
Here is the doc for defaultParallelism : /** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */ def defaultParallelism: Int = { What if the user changes parallelism ? Cheers On Fri, Mar 25, 2016 at 5:33 AM, manasdebashiskar

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-24 Thread Ted Yu
the chain being > purely transformations, then checkpointing instead of saving still wouldn't > execute any action on the RDD -- it would just mark the point at which > checkpointing should be done when an action is eventually run. > > On Wed, Mar 23, 2016 at 7:38 PM, Ted Yu <yuzhih...@gmail.com&g

Re: Extending Spark REST API

2016-03-24 Thread Ted Yu
bq. getServletHandlers is not intended for public use >From MetricsSystem.scala : private[spark] class MetricsSystem private ( Looks like there is no easy way to extend REST API. On Thu, Mar 24, 2016 at 1:09 PM, Sebastian Kochman < sebastian.koch...@outlook.com> wrote: > Hello, > I have a

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
ery...").show() /** works */ > > sqlContext.cacheTable("table") > > sqlContext.sql("...complex query...").show() /** works */ > > sqlContext.sql("...complex query...").show() /** fails */ > > > > On 24.03.2016 13:40, Ted Yu wrote:

Re: apache spark errors

2016-03-24 Thread Ted Yu
lue() > { > return new Jedis("10.101.41.19",6379); > } > }; > > > > and then > > > > .foreachRDD(new VoidFunction<JavaRDD>() { > public void call(JavaRDD rdd) throws Exception { > > for (TopData t: rdd.take(t

Re: apache spark errors

2016-03-24 Thread Ted Yu
s > jedis > 2.8.0 > jar > compile > > > > > > > > How can I look at those tasks? > > > > *Van:* Ted Yu [mailto:yuzhih...@gmail.com] > *Verzonden:* donderdag 24 maart 2016 14:33 > *Aan:* Michel Hubert <mich...@

Re: apache spark errors

2016-03-24 Thread Ted Yu
Which release of Spark are you using ? Have you looked the tasks whose Ids were printed to see if there was more clue ? Thanks On Thu, Mar 24, 2016 at 6:12 AM, Michel Hubert wrote: > HI, > > > > I constantly get these errors: > > > > 0[Executor task launch worker-15]

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Ted Yu
Can you pastebin the stack trace ? If you can show snippet of your code, that would help give us more clue. Thanks > On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI wrote: > > Hi all, > I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a >

Re: Write RDD to Elasticsearch

2016-03-24 Thread Ted Yu
Consider using their forum: https://discuss.elastic.co/c/elasticsearch > On Mar 24, 2016, at 3:09 AM, Jakub Stransky wrote: > > Hi, > > I am trying to write JavaPairRDD into elasticsearch 1.7 using spark 1.2.1 > using elasticsearch-hadoop 2.0.2 > >

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Ted Yu
> before any job" comment pose any restriction in this case since no jobs > have yet been executed on the RDD. > > On Wed, Mar 23, 2016 at 7:18 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> See the doc for checkpoint: >> >>* Mark this RDD

Re: What's the benifit of RDD checkpoint against RDD save

2016-03-23 Thread Ted Yu
See the doc for checkpoint: * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint * directory set with `SparkContext#setCheckpointDir` and all references to its parent * RDDs will be removed. *This function must be called before any job has been* * *

Re: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Ted Yu
SPARK-13383 is fixed in 2.0 only, as of this moment. Any chance of backporting to branch-1.6 ? Thanks On Wed, Mar 23, 2016 at 4:20 PM, Davies Liu wrote: > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang wrote: > > Here is the output: > > > > ==

Re: Spark with Druid

2016-03-23 Thread Ted Yu
Please see: https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani which references https://github.com/SparklineData/spark-druid-olap On Wed, Mar 23, 2016 at 5:59 AM, Raymond Honderdors < raymond.honderd...@sizmek.com> wrote: > Does anyone have a good

Re: Spark Metrics Framework?

2016-03-22 Thread Ted Yu
See related thread: http://search-hadoop.com/m/q3RTtuwg442GBwKh On Tue, Mar 22, 2016 at 12:13 PM, Mike Sukmanowsky < mike.sukmanow...@gmail.com> wrote: > The Source class is private >

Re: Serialization issue with Spark

2016-03-22 Thread Ted Yu
Can you show code snippet and the exception for 'Task is not serializable' ? Please see related JIRA: SPARK-10251 whose pull request contains code for registering classes with Kryo. Cheers On Tue, Mar 22, 2016 at 7:00 AM, Hafsa Asif wrote: > Hello, > I am facing

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Ted Yu
The NullPointerEx came from Spring. Which version of Spring do you use ? Thanks > On Mar 22, 2016, at 6:08 AM, Hafsa Asif wrote: > > yes I know it is because of NullPointerEception, but could not understand > why? > The complete stack trace is : > [2016-03-22

Re: Issue wihle applying filters/conditions in DataFrame in Spark

2016-03-22 Thread Ted Yu
bq. Caused by: java.lang.NullPointerException Can you show the remaining stack trace ? Thanks On Tue, Mar 22, 2016 at 5:43 AM, Hafsa Asif wrote: > Hello everyone, > I am trying to get benefits of DataFrames (to perform all SQL BASED > operations like 'Where Clause',

Re: Unable to run Python Unit tests

2016-03-21 Thread Ted Yu
Can you tell us the commit hash your workspace is based on ? On Mon, Mar 21, 2016 at 8:05 PM, Gayathri Murali < gayathri.m.sof...@gmail.com> wrote: > Hi All, > > I am trying to run ./python/run-tests on my local master branch. I am > getting the following error. I have run this multiple times

Re: Saving model S3

2016-03-21 Thread Ted Yu
gt; > > > > > 2016-03-21 15:24 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > >> Was speculative execution enabled ? >> >> Thanks >> >> On Mar 21, 2016, at 6:19 AM, Yasemin Kaya <godo...@gmail.com> wrote: >> >> Hi, >> >>

Re: cluster randomly re-starting jobs

2016-03-21 Thread Ted Yu
Can you provide a bit more information ? Release of Spark and YARN Have you checked Spark UI / YARN job log to see if there is some clue ? Cheers On Mon, Mar 21, 2016 at 6:21 AM, Roberto Pagliari wrote: > I noticed that sometimes the spark cluster seems to restart

Re: Saving model S3

2016-03-21 Thread Ted Yu
Was speculative execution enabled ? Thanks > On Mar 21, 2016, at 6:19 AM, Yasemin Kaya wrote: > > Hi, > > I am using S3 read data also I want to save my model S3. In reading part > there is no error, but when I save model I am getting this error . I tried to > change the

Re: Building spark submodule source code

2016-03-20 Thread Ted Yu
To speed up the build process, take a look at install_zinc() in build/mvn, around line 83. And the following around line 137: # Now that zinc is ensured to be installed, check its status and, if its # not running or just installed, start it FYI On Sun, Mar 20, 2016 at 7:44 PM, Tenghuan He

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu
g.apache.spark.sql.DataFrame = [Payment date: string, Net: string, > VAT: string, InvoiceNumber: int] > > > HTH > > > > > > > > > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8P

Re: reading csv file, operation on column or columns

2016-03-20 Thread Ted Yu
Please refer to the following methods of DataFrame: def withColumn(colName: String, col: Column): DataFrame = { def drop(colName: String): DataFrame = { On Sun, Mar 20, 2016 at 2:47 PM, Ashok Kumar wrote: > Gurus, > > I would like to read a csv file into a

Re: Flume with Spark Streaming Sink

2016-03-20 Thread Ted Yu
$ jar tvf ./external/flume-sink/target/spark-streaming-flume-sink_2.10-1.6.1.jar | grep SparkFlumeProtocol 841 Thu Mar 03 11:09:36 PST 2016 org/apache/spark/streaming/flume/sink/SparkFlumeProtocol$Callback.class 2363 Thu Mar 03 11:09:36 PST 2016

Re: Limit pyspark.daemon threads

2016-03-20 Thread Ted Yu
I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question: spark.python.worker.memory 512m Amount of memory to use per python worker process during aggregation, in the same format as JVM

Re: spark-submit reset JVM

2016-03-20 Thread Ted Yu
Not that I know of. Can you be a little more specific on which JVM(s) you want restarted (assuming spark-submit is used to start a second job) ? Thanks On Sun, Mar 20, 2016 at 6:20 AM, Udo Fholl wrote: > Hi all, > > Is there a way for spark-submit to restart the JVM in

Re: Extra libs in executor classpath

2016-03-19 Thread Ted Yu
For your last point, spark-submit has: if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi Meaning the script would determine the proper SPARK_HOME variable. FYI On Wed, Mar 16, 2016 at 4:22 AM, Леонид Поляков wrote: > Hello, guys! > > >

Re: [Spark-1.5.2]Column renaming with withColumnRenamed has no effect

2016-03-19 Thread Ted Yu
Can you give a bit more detail ? Release of Spark symptom of renamed column being not recognized Please take a look at "withColumnRenamed" test in: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala On Thu, Mar 17, 2016 at 2:02 AM, Divya Gehlot wrote:

Re: Stop spark application when the job is complete.

2016-03-19 Thread Ted Yu
Can you call sc.stop() after indexing into elastic search ? > On Mar 16, 2016, at 9:17 PM, Imre Nagi wrote: > > Hi, > > I have a spark application for batch processing in standalone cluster. The > job is to query the database and then do some transformation,

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu
I looked at the places in SparkContext.scala where NewHadoopRDD is constrcuted. It seems the Configuration object shouldn't be null. Which hbase release are you using (so that I can see which line the NPE came from) ? Thanks On Fri, Mar 18, 2016 at 8:05 AM, Lubomir Nerad

Re: Error using collectAsMap() in scala

2016-03-19 Thread Ted Yu
It is defined in: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman wrote: > I am using following code snippet in scala: > > > *val dict: RDD[String] = sc.textFile("path/to/csv/file")* > *val

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu
bq. .drop("Col9") Could it be due to the above ? On Wed, Mar 16, 2016 at 7:29 PM, Divya Gehlot wrote: > Hi, > I am dynamically doing union all and adding new column too > > val dfresult = >>

Re: Checkpoint of DStream joined with RDD

2016-03-19 Thread Ted Yu
This is the line where NPE came from: if (conf.get(SCAN) != null) { So Configuration instance was null. On Fri, Mar 18, 2016 at 9:58 AM, Lubomir Nerad <lubomir.ne...@oracle.com> wrote: > The HBase version is 1.0.1.1. > > Thanks, > Lubo > > > On 18.3.2016 17:29,

Re: [Error] : dynamically union All + adding new column

2016-03-19 Thread Ted Yu
It turned out that Col1 appeared twice in the select :-) > On Mar 16, 2016, at 7:29 PM, Divya Gehlot wrote: > > Hi, > I am dynamically doing union all and adding new column too > >> val dfresult = >>

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Ted Yu
bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 Do you mind showing more of your code involving the map() ? On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > Hello, > I found a strange behavior after executing a prediction with MLIB. > My code

Re: SparkContext.stop() takes too long to complete

2016-03-18 Thread Ted Yu
Which version of hadoop do you use ? bq. Requesting to kill executor(s) 1136 Can you find more information on executor 1136 ? Thanks On Fri, Mar 18, 2016 at 4:16 PM, Nezih Yigitbasi < nyigitb...@netflix.com.invalid> wrote: > Hi Spark experts, > I am using Spark 1.5.2 on YARN with dynamic

Re: DistributedLDAModel missing APIs in org.apache.spark.ml

2016-03-18 Thread Ted Yu
Can you utilize this function of DistributedLDAModel ? override protected def getModel: OldLDAModel = oldDistributedModel cheers On Fri, Mar 18, 2016 at 7:34 AM, cindymc wrote: > I like using the new DataFrame APIs on Spark ML, compared to using RDDs in > the

Re: How to add an accumulator for a Set in Spark

2016-03-15 Thread Ted Yu
Please take a look at: core/src/test/scala/org/apache/spark/AccumulatorSuite.scala FYI On Tue, Mar 15, 2016 at 4:29 PM, SRK wrote: > Hi, > > How do I add an accumulator for a Set in Spark? > > Thanks! > > > > -- > View this message in context: >

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
t; > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 15 March 2016 at 23:17, Ted Yu <yuzhih...@gmail.com> wrote: >> 1.0 >> ... >> scala >> >>> On Tue, Mar 15

Re: Error building spark app with Maven

2016-03-15 Thread Ted Yu
1.0 ... scala On Tue, Mar 15, 2016 at 4:14 PM, Mich Talebzadeh wrote: > An observation > > Once compiled with MVN the job submit works as follows: > > + /usr/lib/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --packages > com.databricks:spark-csv_2.11:1.3.0 --class

Re: mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Ted Yu
Which version of Spark are you using ? Can you show the code snippet w.r.t. broadcast variable ? Thanks On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar wrote: > Hi, > I have a streaming application that takes data from a kafka topic and uses > mapwithstate. > After

Re: Compress individual RDD

2016-03-15 Thread Ted Yu
Looks like there is no such capability yet. How would you specify which rdd's to compress ? Thanks > On Mar 15, 2016, at 4:03 AM, Nirav Patel wrote: > > Hi, > > I see that there's following spark config to compress an RDD. My guess is it > will compress all RDDs of

Re: Spark streaming - update configuration while retaining write ahead log data?

2016-03-15 Thread Ted Yu
I did a quick search but haven't found JIRA in this regard. If configuration is separate from checkpoint data, more use cases can be accommodated. > On Mar 15, 2016, at 2:21 AM, Saisai Shao wrote: > > Currently configuration is a part of checkpoint data, and when

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Ted Yu
There're build jobs for both on Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/ https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/ You can choose either one. I use mvn. On Tue, Mar 15, 2016 at 3:42 AM, Mich Talebzadeh

Re: Failing MiMa tests

2016-03-14 Thread Ted Yu
Please refer to JIRAs which were related to MiMa e.g. [SPARK-13834][BUILD] Update sbt and sbt plugins for 2.x. It would be easier for other people to help if you provide link to your PR. Cheers On Mon, Mar 14, 2016 at 7:22 PM, Gayathri Murali < gayathri.m.sof...@gmail.com> wrote: > Hi All, > >

Re: Exceptions when accessing Spark metrics with REST API

2016-03-14 Thread Ted Yu
ject.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at > org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) > at > org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) > at

Re: Exceptions when accessing Spark metrics with REST API

2016-03-14 Thread Ted Yu
Which Spark release do you use ? For NoSuchElementException, was there anything else in the stack trace ? Thanks On Mon, Mar 14, 2016 at 12:12 PM, Boric Tan wrote: > Hi there, > > I was trying to access application information with REST API. Looks like > the > top

Re: append rows to dataframe

2016-03-14 Thread Ted Yu
Summarizing an offline message: The following worked for Divya: dffiltered = dffiltered.unionAll(dfresult.filter ... On Mon, Mar 14, 2016 at 5:54 AM, Lohith Samaga M wrote: > If all sql results have same set of columns you could UNION all the > dataframes > > Create

Re: Spark 1.6.1 : SPARK-12089 : java.lang.NegativeArraySizeException

2016-03-13 Thread Ted Yu
Here is related code: final int length = totalSize() + neededSize; if (buffer.length < length) { // This will not happen frequently, because the buffer is re-used. final byte[] tmp = new byte[length * 2]; Looks like length was positive (since it was bigger than buffer.length)

Re: append rows to dataframe

2016-03-13 Thread Ted Yu
dffiltered = unionAll(dfresult.filter(dfFilterSQLs( i)).select("Col1","Col2","Col3","Col4","Col5")) FYI On Sun, Mar 13, 2016 at 8:50 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you tried unionAll() method of DataFrame ? &g

Re: append rows to dataframe

2016-03-13 Thread Ted Yu
Have you tried unionAll() method of DataFrame ? On Sun, Mar 13, 2016 at 8:44 PM, Divya Gehlot wrote: > Hi, > > Please bear me for asking such a naive question > I have list of conditions (dynamic sqls) sitting in hbase table . > I need to iterate through those dynamic

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Ted Yu
The backport would be done under HBASE-14160. FYI On Sun, Mar 13, 2016 at 4:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > Ted, > > Is there anything in the works or are there tasks already to do the > back-porting? > > Just curious. > > Thanks, > Ben > &

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Ted Yu
ies > were introduced? If so, I can pulled that version at time and try again. > > Thanks, > Ben > > On Mar 13, 2016, at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > Benjamin: > Since hbase-spark is in its own module, you can pull the whole hbase-spark > subtree into

Re: Spark Job on YARN accessing Hbase Table

2016-03-13 Thread Ted Yu
t; By the way, we are using Spark 1.6 if it matters. > > Thanks, > Ben > > On Feb 10, 2016, at 2:34 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Have you tried adding hbase client jars to spark.executor.extraClassPath ? > > Cheers > > On Wed, Feb 10, 2016 at 12:17 A

Re: NullPointerException

2016-03-12 Thread Ted Yu
Interesting. If kv._1 was null, shouldn't the NPE have come from getPartition() (line 105) ? Was it possible that records.next() returned null ? On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph wrote: > Looking at ExternalSorter.scala line 192, i suspect some input

Re: NullPointerException

2016-03-11 Thread Ted Yu
Which Spark release do you use ? I wonder if the following may have fixed the problem: SPARK-8029 Robust shuffle writer JIRA is down, cannot check now. On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru wrote: > I am seeing the following exception in my Spark Cluster every

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Ted Yu
KafkaLZ4BlockOutputStream is in kafka-clients jar : $ jar tvf kafka-clients-0.8.2.0.jar | grep KafkaLZ4BlockOutputStream 1609 Wed Jan 28 22:30:36 PST 2015 org/apache/kafka/common/message/KafkaLZ4BlockOutputStream$BD.class 2918 Wed Jan 28 22:30:36 PST 2015

Re: Newbie question - Help with runtime error on augmentString

2016-03-11 Thread Ted Yu
Looks like Scala version mismatch. Are you using 2.11 everywhere ? On Fri, Mar 11, 2016 at 10:33 AM, vasu20 wrote: > Hi > > Any help appreciated on this. I am trying to write a Spark program using > IntelliJ. I get a run time error as soon as new SparkConf() is called from

Re: Does Spark support in-memory shuffling?

2016-03-11 Thread Ted Yu
Please take a look at SPARK-3376 and discussion on https://github.com/apache/spark/pull/5403 FYI On Fri, Mar 11, 2016 at 6:37 AM, Xudong Zheng wrote: > Hi all, > > Does Spark support in-memory shuffling now? If not, is there any > consideration for it? > > Thanks! > > -- >

Re: Doubt on data frame

2016-03-11 Thread Ted Yu
temporary tables are associated with SessionState which is used by SQLContext. Did you keep the session ? Cheers On Fri, Mar 11, 2016 at 5:02 AM, ram kumar wrote: > Hi, > > I registered a dataframe as a table using registerTempTable > and I didn't close the Spark

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu
thod. Everything is working there. > > *pastebin of log:* http://pastebin.com/0LjTWLfm > > > Thanks > Shams > > On Thu, Mar 10, 2016 at 8:11 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Can you provide a bit more information ? >> >> Release of Spark

Re: does spark needs dedicated machines to run on

2016-03-10 Thread Ted Yu
Can you provide a bit more information ? Release of Spark command for submitting your app code snippet of your app pastebin of log Thanks On Thu, Mar 10, 2016 at 6:32 AM, Shams ul Haque wrote: > Hi, > > I have developed a spark realtime app and started spark-standalone on

Re: How to obtain JavaHBaseContext to connection SparkStreaming with HBase

2016-03-09 Thread Ted Yu
bq. Question is how to get maven repository As you may have noted, version has SNAPSHOT in it. Please checkout latest code from master branch and build it yourself. 2.0 release is still a few months away - though backport of hbase-spark module should come in 1.3 release. On Wed, Mar 9, 2016 at

Re: spark streaming doesn't pick new files from HDFS

2016-03-09 Thread Ted Yu
bq. drwxr-xr-x - tomcat7 supergroup 0 2016-03-09 23:16 /tmp/swg If I read the above line correctly, the size of the file was 0. On Wed, Mar 9, 2016 at 10:00 AM, srimugunthan dhandapani < srimugunthan.dhandap...@gmail.com> wrote: > Hi all > I am working in cloudera CDH5.6 and

Re: binary file deserialization

2016-03-09 Thread Ted Yu
bq. there is a varying number of items for that record If the combination of items is very large, using case class would be tedious. On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj wrote: > You can load that binary up as a String RDD, then map over that RDD and > convert

<    1   2   3   4   5   6   7   8   9   10   >