Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
nside the < raw >..< /raw > so Text-only > mail clients prune what’s inside. > Anyway here’s the text again. (Inline) > > > On 02-May-2016, at 23:56, Ted Yu <yuzhih...@gmail.com> wrote: > > > > Maybe you were trying to embed pictures for the error and you

Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Maybe you were trying to embed pictures for the error and your code - but they didn't go through. On Mon, May 2, 2016 at 10:32 AM, meson10 wrote: > Hi, > > I am trying to save a RDD to Cassandra but I am running into the following > error: > > > > The Python code looks

Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. Thanks > On May 1, 2016, at 9:19 PM, Buntu Dev wrote: > > I got a 10g limitation on the executors and operating on parquet dataset with > block size 70M with 200 blocks. I keep hitting the memory limits when doing a > 'select * from

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org: unkn

2016-05-01 Thread Ted Yu
er artifact > org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): > Connect to repo1.maven.org:443 [repo1.maven.org/23.235.47.209] failed: > Connection timed out and 'parent.relativePath' points at wrong local POM @ > line 22, column 11 -> [Help 2] > [ERROR] > [ER

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2): repo1.maven.org: un

2016-05-01 Thread Ted Yu
s because fail to download this url: > http://maven.twttr.com/org/apache/apache/14/apache-14.pom > > > -- 原始邮件 ------ > *发件人:* "Ted Yu";<yuzhih...@gmail.com>; > *发送时间:* 2016年5月1日(星期天) 晚上9:27 > *收件人:* "sunday2000"<2314476...

Re: Can not import KafkaProducer in spark streaming job

2016-05-01 Thread Ted Yu
According to examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala : import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord} Can you give the command line you used to submit the job ? Probably classpath issue. On Sun, May 1, 2016 at

Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): repo1.maven.org:

2016-05-01 Thread Ted Yu
bq. Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1 Looks like you were using Spark 1.6.1 Can you check firewall settings ? I saw similar report from Chinese users. Consider using proxy. On Sun, May 1, 2016 at 4:19 AM, sunday2000 <2314476...@qq.com> wrote: > Hi, > We

Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Ted Yu
Can you provide a bit more information: Does the smaller dataset have skew ? Which release of Spark are you using ? How much memory did you specify ? Thanks On Sat, Apr 30, 2016 at 1:17 PM, Brandon White wrote: > Hello, > > I am writing to datasets. One dataset is

Re: [2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Ted Yu
For #1, have you seen this JIRA ? [SPARK-14867][BUILD] Remove `--force` option in `build/mvn` On Thu, Apr 28, 2016 at 8:27 PM, Demon King wrote: > BUG 1: > I have installed maven 3.0.2 in system, When I using make-distribution.sh > , it seem not use maven 3.2.2 but use

Re: Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Ted Yu
What happened when you tried to access port 8080 ? Checking iptables settings is good to do. At my employer, we use OpenStack clusters daily and don't encounter much problem - including UI access. Probably some settings should be tuned. On Thu, Apr 28, 2016 at 5:03 AM, Dan Dong

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread Ted Yu
common/*:/usr/local/project/hadoop/share/hadoop/hdfs:/usr/local/project/hadoop/share/hadoop/hdfs/lib/*:/usr/local/project/hadoop/share/hadoop/hdfs/*:/usr/local/project/hadoop/share/hadoop/yarn/lib/*:/usr/local/project/hadoop/share/hadoop/yarn/*:/usr/local/project/hadoop/share/hadoop/mapreduce/lib/*:/

Re: Save DataFrame to HBase

2016-04-28 Thread Ted Yu
e > on how to save data. There is only one for reading/querying data. Will this > be added when the final version does get released? > > Thanks, > Ben > >> On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> The hbase-spark

Re: what should I do when spark ut hang?

2016-04-27 Thread Ted Yu
Did you have a chance to take jstack when VersionsSuite was running ? You can use the following command to run the test: sbt/sbt "test-only org.apache.spark.sql.hive.client.VersionsSuite" On Wed, Apr 27, 2016 at 9:01 PM, Demon King wrote: > Hi, all: >I compile

Re: Cant join same dataframe twice ?

2016-04-27 Thread Ted Yu
>>> >>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >>> ambiguous, could be: b#6, b#14.; >>> If same case, this message makes sense and this is clear. >>> >>> Thought? >>> >>> // maropu >>> >

Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread Ted Yu
Did you do the import as the first comment shows ? > On Apr 27, 2016, at 2:42 AM, shengshanzhang wrote: > > Hi, > > On spark website, there is code as follows showing how to create > datasets. > > > However when i input this line into

Re: Cant join same dataframe twice ?

2016-04-26 Thread Ted Yu
.spark.sql.AnalysisException: Reference 'b' is > ambiguous, could be: b#6, b#14.; > If same case, this message makes sense and this is clear. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com

Re: Reading from Amazon S3

2016-04-26 Thread Ted Yu
Looking at the cause of the error, it seems hadoop-aws-xx.jar (corresponding to the version of hadoop you use) was missing in classpath. FYI On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj wrote: > Hi All, > I am trying to read a file stored in Amazon S3. > I wrote

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/SparkContext.scala * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`, * * then `rdd` contains * {{{ * (a-hdfs-path/part-0, its content) * (a-hdfs-path/part-1, its content) * ... *

Re: Cant join same dataframe twice ?

2016-04-25 Thread Ted Yu
Can you show us the structure of df2 and df3 ? Thanks On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot wrote: > Hi, > I am using Spark 1.5.2 . > I have a use case where I need to join the same dataframe twice on two > different columns. > I am getting error missing

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Ted Yu
Can you show snippet of your code which demonstrates what you observed ? Thansk On Mon, Apr 25, 2016 at 8:38 AM, Weiping Qu wrote: > Thanks. > I read that from the specification. > I thought the way people distinguish actions and transformations depends > on whether

Re: next on empty iterator though i used hasNext

2016-04-25 Thread Ted Yu
Can you show more of your code inside the while loop ? Which version of Spark / Kinesis do you use ? Thanks On Mon, Apr 25, 2016 at 4:04 AM, Selvam Raman wrote: > I am reading a data from Kinesis stream (merging shard values with union > stream) to spark streaming. then

Re: Using Aggregate and group by on spark Dataset api

2016-04-24 Thread Ted Yu
Have you taken a look at: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala On Sun, Apr 24, 2016 at 8:18 AM, coder wrote: > JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map( > new Function() { > public

Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Ted Yu
Which version of Spark are you using ? How did you increase the open file limit ? Which operating system do you use ? Please see Example 6. ulimit Settings on Ubuntu under: http://hbase.apache.org/book.html#basic.prerequisites On Sun, Apr 24, 2016 at 2:34 AM, fanooos

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-23 Thread Ted Yu
Can you check that the DFSClient Spark uses is the same version as on the server side ? The client and server (NameNode) negotiate a "crypto protocol version" - this is a forward-looking feature. Please note: bq. Client provided: [] Meaning client didn't provide any supported crypto protocol

Re: Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Ted Yu
Which hbase release are you using ? Below is the write method from hbase 1.1 : public void write(KEY key, Mutation value) throws IOException { if (!(value instanceof Put) && !(value instanceof Delete)) { throw new IOException("Pass a Delete or a Put"); }

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
The class is private : final class OffsetRange private( On Fri, Apr 22, 2016 at 4:08 PM, Mich Talebzadeh wrote: > Ok I decided to forgo that approach and use an existing program of mine > with slight modification. The code is this > > import

Re: How this unit test passed on master trunk?

2016-04-22 Thread Ted Yu
This was added by Xiao through: [SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error Handling when DataFrame/DataSet Functions using Star I tried in spark-shell and got: scala> val first = structDf.groupBy($"a").agg(min(struct($"record.*"))).first() first:

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Marcelo: >From yesterday's thread, Mich revealed that he was looking at: https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala which references SparkFunSuite. In an earlier thread, Mich was asking about CEP. Just

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
scala:654: > not found: type WindowState > [error] def deep(in: Tick, ew: WindowState): Boolean = { > [error]^ > [error] > /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:660: > not found: type WindowState > [error]

Re: Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread Ted Yu
There is not much in the body of email. Can you elaborate what issue you encountered ? Thanks On Fri, Apr 22, 2016 at 2:27 AM, Rowson, Andrew G. (TR Technology & Ops) < andrew.row...@thomsonreuters.com> wrote: > > > > This e-mail is for the sole use of the

Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Normally Logging would be included in spark-shell session since spark-core jar is imported by default: scala> import org.apache.spark.internal.Logging import org.apache.spark.internal.Logging See this JIRA: [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging In

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
dOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 21 April 2016 at 20:24, Ted Yu <yuzhih...@gmail.com> wrote: > >> Plug in 1.5.1 for your jars: >> >> $ jar tvf ./core/target/s

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
t; 3982 Wed Sep 23 23:34:26 BST 2015 org/apache/spark/SparkFunSuite.class >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJ

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > >

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Ted Yu
In KafkaWordCount , the String is sent back and producer.send() is called. I guess if you don't find via solution in your current design, you can consider the above. On Thu, Apr 21, 2016 at 10:04 AM, Alexander Gallego wrote: > Hello, > > I understand that you cannot

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
e.spark > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmi

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
ome/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,/home/hduser/jars/scalatest_2.11-2.2.6.jar' >> >> >> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll} >> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll} >> >> >

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
ced an extra leading comma after '--jars' in your email. Not sure if that matters. On Thu, Apr 21, 2016 at 8:39 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Mich: > > $ jar tvf > /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar > | grep BeforeA

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Mich: $ jar tvf /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar | grep BeforeAndAfter 4257 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter$class.class 2602 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter.class 1998 Sat Dec 26

Re: Save DataFrame to HBase

2016-04-21 Thread Ted Yu
The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do this. On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim wrote: > Has anyone found an easy way to save a DataFrame into HBase? > > Thanks, > Ben > > >

Re: RDD generated from Dataframes

2016-04-21 Thread Ted Yu
In upcoming 2.0 release, the signature for map() has become: def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan { Note: DataFrame and DataSet are unified in 2.0 FYI On Thu, Apr 21, 2016 at 6:49 AM, Apurva Nandan wrote: > Hello everyone, > > Generally

Re: StructField Translation Error with Spark SQL

2016-04-21 Thread Ted Yu
> > Can't translate null value for field > StructField(density,DecimalType(4,2),true) > On Apr 21, 2016 1:37 AM, "Ted Yu" <yuzhih...@gmail.com> wrote: > >> The weight field is not nullable. >> >> Looks like your source table had null value for this fi

Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Ted Yu
The weight field is not nullable. Looks like your source table had null value for this field. On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu < cprenzb...@gmail.com> wrote: > Hi, > > I am using spark 1.4.1 and trying to copy all rows from a table in one > MySQL Database to a Amazon RDS

Re: Invoking SparkR from Spark shell

2016-04-20 Thread Ted Yu
Please take a look at: https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes On Wed, Apr 20, 2016 at 9:50 AM, Ashok Kumar wrote: > Hi, > > I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R > with Spark. > > Is there a s hell

Re: 回复:Spark sql and hive into different result with same sql

2016-04-20 Thread Ted Yu
Do you mind trying out build from master branch ? 1.5.3 is a bit old. On Wed, Apr 20, 2016 at 5:25 AM, FangFang Chen wrote: > I found spark sql lost precision, and handle data as int with some rule. > Following is data got via hive shell and spark sql, with same sql

Re: Why very small work load cause GC overhead limit?

2016-04-19 Thread Ted Yu
Can you tell us the memory parameters you used ? If you can capture jmap before the GC limit was exceeded, that would give us more clue. Thanks > On Apr 19, 2016, at 7:40 PM, "kramer2...@126.com" wrote: > > Hi All > > I use spark doing some calculation. > The situation

Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Ted Yu
Using http://www.ruddwire.com/handy-code/date-to-millisecond-calculators/#.VxZh3iMrKuo , 1460823008000 is shown to be 'Sat Apr 16 2016 09:10:08 GMT-0700' Can you clarify the 4 day difference ? bq. 'right now April 14th' The date of your email was Apr 16th. On Sat, Apr 16, 2016 at 9:39 AM,

Re: hbaseAdmin tableExists create catalogTracker for every call

2016-04-19 Thread Ted Yu
The CatalogTracker object may not be used by all the methods of HBaseAdmin. Meaning, when HBaseAdmin is constructed, we don't need CatalogTracker. On Tue, Apr 19, 2016 at 6:09 AM, WangYQ wrote: > in hbase 0.98.10, class "HBaseAdmin " > line 303, method

Re: Logging in executors

2016-04-18 Thread Ted Yu
ly not working, at least for logging configuration. > > Thanks, > -carlos. > > On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1 >> >> On Fri, Apr 15, 2016 at 5:38 AM,

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be > working now. Let me know if you still encounter any problems with > unarchiving. > > On Sat, Apr 16, 2016 at 3:10 PM Ted Yu <yuzhih...@gmail.com> wrote: > >> Pardon me - there is no tarball for hado

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
I tried changing the URL to > https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz > and I get a NoSuchKey error. > > Should I just go with it even though it says hadoop2.6? > > On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu <yuzhih...@gmail.com> wrote: >

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Apr 16, 2016 at 2:14 PM, Ted Yu <yuzhih...@gmail.com> wrote: > From the output you posted: > --- > Unpacking Spark > > gzip: stdin: not in gzip format > tar: Child returned status 1 > tar: Error is not recoverable: exiting now > --- > > The artifac

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
>From the output you posted: --- Unpacking Spark gzip: stdin: not in gzip format tar: Child returned status 1 tar: Error is not recoverable: exiting now --- The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt. This problem has been reported in other threads. Try

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Ted Yu
Kevin: Can you describe how you got over the Metadata fetch exception ? > On Apr 16, 2016, at 9:41 AM, Kevin Eid wrote: > > One last email to announce that I've fixed all of the issues. Don't hesitate > to contact me if you encounter the same. I'd be happy to help. > >

Re: Apache Flink

2016-04-16 Thread Ted Yu
Looks like this question is more relevant on flink mailing list :-) On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh wrote: > Hi, > > Has anyone used Apache Flink instead of Spark by any chance > > I am interested in its set of libraries for Complex Event Processing.

Re: ERROR [main] client.ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper.

2016-04-16 Thread Ted Yu
Please send query to user@hbase This is the default value: zookeeper.znode.parent /hbase Looks like hbase-site.xml accessible on your client didn't have up-to-date value for zookeeper.znode.parent Please make sure hbase-site.xml with proper config is on the classpath. On Sat, Apr 16,

Re: Logging in executors

2016-04-15 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1 On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas wrote: > Hi guys, > > any clue on this? Clearly the > spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the > executors. > > Thanks, >

Re: How to stop hivecontext

2016-04-15 Thread Ted Yu
You can call stop() method. > On Apr 15, 2016, at 5:21 AM, ram kumar wrote: > > Hi, > I started hivecontext as, > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc); > > I want to stop this sql context > > Thanks

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Ted Yu
For Parquet, please take a look at SPARK-1251 For ORC, not sure. Looking at git history, I found ORC mentioned by SPARK-1368 FYI On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli wrote: > I am needing this fact for the research paper I am writing right now. > > When did Spark

Re: Error with --files

2016-04-14 Thread Ted Yu
bq. localtest.txt#appSees.txt Which file did you want to pass ? Thanks On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen wrote: > Hi All, > > I'm trying to use the --files option with yarn: > > spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files >>

Re: Spark Yarn closing sparkContext

2016-04-14 Thread Ted Yu
Can you pastebin the failure message ? Did you happen to take jstack during the close ? Which Hadoop version do you use ? Thanks > On Apr 14, 2016, at 5:53 AM, nihed mbarek wrote: > > Hi, > I have an issue with closing my application context, the process take a long >

Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu
w.r.t. the effective storage level log, here is the JIRA which introduced it: [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin wrote: > Hi all, > > If I am using a Custom Receiver with

Re: Logging in executors

2016-04-13 Thread Ted Yu
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j. configuration=env/dev/log4j-driver.properties" I think the above may have a typo : you refer to log4j-driver.properties in both arguments. FYI On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas wrote: > Hi guys, > >

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Ted Yu
FYI https://documentation.cpanel.net/display/CKB/How+To+Clear+Your+DNS+Cache#HowToClearYourDNSCache-MacOS ®10.10 https://www.whatsmydns.net/flush-dns.html#linux On Tue, Apr 12, 2016 at 2:44 PM, Bibudh Lahiri wrote: > Hi, > > I am trying to run a piece of code with

Re: JavaRDD with custom class?

2016-04-12 Thread Ted Yu
You can find various examples involving Serializable Java POJO e.g. ./examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java Please pastebin some details on 'Task not serializable error' Thanks On Tue, Apr 12, 2016 at 12:44 PM, Daniel Valdivia wrote:

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup On Tue, Apr 12, 2016 at 8:52 AM, ImMr.K <875061...@qq.com> wrote: > But how to import spark repo into idea or eclipse? > > > > -- 原始邮件 ---------

Re: build/sbt gen-idea error

2016-04-12 Thread Ted Yu
gen-idea doesn't seem to be a valid command: [warn] Ignoring load failure: no project loaded. [error] Not a valid command: gen-idea [error] gen-idea On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote: > Hi, > I have cloned spark and , > cd spark > build/sbt gen-idea > > got the

Re: Is storage resources counted during the scheduling

2016-04-11 Thread Ted Yu
See https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application On Mon, Apr 11, 2016 at 3:15 PM, Jialin Liu wrote: > Hi Spark users/experts, > > I’m wondering how does the Spark scheduler work? > What kind of resources will be considered during the

Re: Read JSON in Dataframe and query

2016-04-11 Thread Ted Yu
Please take a look at sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Cheers On Mon, Apr 11, 2016 at 12:13 PM, Radhakrishnan Iyer < radhakrishnan.i...@citiustech.com> wrote: > Hi all, > > > > I am new to Spark. > > I have a json in below format : > >

Re: Hello !

2016-04-11 Thread Ted Yu
For SparkR, please refer to https://spark.apache.org/docs/latest/sparkr.html bq. on Ubuntu or CentOS Both platforms are supported. On Mon, Apr 11, 2016 at 1:08 PM, wrote: > Dear Experts , > > I am posting this for your information. I am a newbie to spark. > I am

Re: Weird error while serialization

2016-04-10 Thread Ted Yu
map(lambda x : x.rsplit('\t',1)).map(lambda x : > [x[0],getRows(x[1])]).cache()\ > .groupBy(lambda x : x[0].split('\t')[1]).mapValues(lambda x : > list(x)).cache() > > text1.count() > > Thanks and Regards, > Suraj Sheth > > On Sun, Apr 10, 2016 at 1:19 AM, Ted Yu <

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-10 Thread Ted Yu
llecting only TaskEnd Events. > > I can do the event wise summation for couple of runs and get back to you. > > > > Thanks, > > Jasmine > > > > *From:* Ted Yu [mailto:yuzhih...@gmail.com] > *Sent:* Thursday, April 07, 2016 1:43 PM > *To:* JasmineGeorge >

Re: Datasets combineByKey

2016-04-10 Thread Ted Yu
Haven't found any JIRA w.r.t. combineByKey for Dataset. What's your use case ? Thanks On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela wrote: > Is there (planned ?) a combineByKey support for Dataset ? > Is / Will there be a support for combiner lifting ? > > Thanks, > Amit >

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Ted Yu
Looks like the exception occurred on driver. Consider increasing the values for the following config: conf.set("spark.driver.memory", "10240m") conf.set("spark.driver.maxResultSize", "2g") Cheers On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev wrote: > I'm running it via

Re: Weird error while serialization

2016-04-09 Thread Ted Yu
The value was out of the range of integer. Which Spark release are you using ? Can you post snippet of code which can reproduce the error ? Thanks On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH wrote: > I am trying to perform some processing and cache and count the RDD. >

Re: Unable run Spark in YARN mode

2016-04-09 Thread Ted Yu
mahesh : bq. :16: error: not found: value sqlContext Please take a look at: https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext for how the import should be used. Please include version of Spark and the commandline you used in the reply.

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
roupBy and a count in > pyspark.sql on a Spark DataFrame. > > Any ideas? > > Nicolas > > On Fri, Apr 8, 2016 at 1:13 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Did you encounter similar error on a smaller dataset ? >> >> Which release of Spark are you us

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Did you encounter similar error on a smaller dataset ? Which release of Spark are you using ? Is it possible you have an incompatible snappy version somewhere in your classpath ? Thanks On Fri, Apr 8, 2016 at 12:36 PM, entee wrote: > I'm trying to do a relatively large

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread Ted Yu
I searched 1.6.1 code base but didn't find how this can be configured (within Spark). On Fri, Apr 8, 2016 at 9:01 AM, nihed mbarek wrote: > Hi > How to configure parquet.block.size on Spark 1.6 ? > > Thank you > Nihed MBAREK > > > -- > > M'BAREK Med Nihed, > Fedora

Re: can not join dataset with itself

2016-04-08 Thread Ted Yu
Looks like you're using Spark 1.6.x What error(s) did you get for the first two joins ? Thanks On Fri, Apr 8, 2016 at 3:53 AM, JH P wrote: > Hi. I want a dataset join with itself. So i tried below codes. > > 1. newGnsDS.joinWith(newGnsDS, $"dataType”) > > 2.

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ? Have you registered to all the events provided by SparkListener ? If so, can you do event-wise summation of execution time ? Thanks On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge wrote: > We are running a batch job with the following

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on: [INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed wrote: > Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" > compile. I guess the internal

Re: how to query the number of running executors?

2016-04-06 Thread Ted Yu
Have you looked at SparkListener ? /** * Called when the driver registers a new executor. */ def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit /** * Called when the driver removes an executor. */ def onExecutorRemoved(executorRemoved:

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Ted Yu
Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too low for your use case ? Cheers On Wed, Apr 6, 2016 at 9:19 AM, Peter Rudenko wrote: > Hi i have a situation, say i have a yarn cluster with 2GB RAM. I'm >

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
The error was due to REPL expecting an integer (index to the Array) whereas "MAX(count)" was a String. What do you want to achieve ? On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel wrote: > Hello, > > i am writing one spark application i which i need the index of the

Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
Did you define idxmax() method yourself ? Thanks On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel wrote: > Hello, > > i am writing one spark application i which i need the index of the maximum > element. > > My table has one column only and i want the index of the maximum

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu
bq. I'm on version 2.10 for spark The above is Scala version. Can you give us the Spark version ? Thanks On Mon, Apr 4, 2016 at 2:36 PM, mpawashe wrote: > Hi all, > > I am using Spark Streaming API (I'm on version 2.10 for spark and > streaming), and I am running into a

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler If the changes can be ported over to 1.6.1, do you mind reproducing the issue there ? I ask because master branch changes very fast. It would be good to narrow the scope where the behavior you observed started showing. On Mon, Apr 4, 2016 at 6:12

Re: multiple splits fails

2016-04-03 Thread Ted Yu
t;* > > > > http://talebzadehmich.wordpress.com > > > > On 3 April 2016 at 18:05, Ted Yu <yuzhih...@gmail.com> wrote: > >> Mich: >> See the following method of DStream: >> >>* Print the first num elements of each RDD gen

Re: multiple splits fails

2016-04-03 Thread Ted Yu
ot a member of >> org.apache.spark.streaming.dstream.DStream[(String, Int)] >> val v = lines.filter(_.contains("ASE 15")).filter(_ >> contains("UPDATE INDEX STATISTICS")).flatMap(line => >> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + >>

Re: multiple splits fails

2016-04-03 Thread Ted Yu
refer to the content of the stream here? > > Thanks > > > > > > > > > > > // > // Now want to do some analysis on the same text file > // > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id

Re: multiple splits fails

2016-04-03 Thread Ted Yu
bq. split"\t," splits the filter by carriage return Minor correction: "\t" denotes tab character. On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas wrote: > Hi Mich, > > 1. The first underscore in your filter call is refering to a line in the > file (as textFile() results in a

Re: Multiple lookups; consolidate result and run further aggregations

2016-04-02 Thread Ted Yu
Looking at the implementation for lookup in PairRDDFunctions, I think your understanding is correct. On Sat, Apr 2, 2016 at 3:16 AM, Nirav Patel wrote: > I will start by question: Is spark lookup function on pair rdd is a driver > action. ie result is returned to driver?

Re: Scala: Perform Unit Testing in spark

2016-04-02 Thread Ted Yu
ranshu...@gmail.com> wrote: > When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0", *it > should include spark-core_2.10-1.6.1-tests.jar. > Why do I need to use the jar file explicitly? > > And how do I use the jars for compiling with *

Re: Problem with jackson lib running on spark

2016-04-01 Thread Ted Yu
Thanks for sharing the workaround. Probably send a PR on tranquilizer github :-) On Fri, Apr 1, 2016 at 12:50 PM, Marcelo Oikawa wrote: > Hi, list. > > Just to close the thread. Unfortunately, I didnt solve the jackson lib > problem but I did a workaround that

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
ot;spark-core_2.10" % >> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )* > > > > > On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Assuming your code is written in Scala, I would suggest u

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
Assuming your code is written in Scala, I would suggest using ScalaTest. Please take a look at the XXSuite.scala files under mllib/ On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman wrote: > Hello, > > I have a code written in scala using Mllib. I want to perform unit

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
so do I have to set them for the history-server? The daemon? The workers? > > And what if I use the java API instead of spark-submit for the jobs? > > I guess that the spark-defaults.conf are obsolete for the java API? > > > Am 2016-04-01 18:58, schrieb Ted Yu: > >&g

Re: OutOfMemory with wide (289 column) dataframe

2016-04-01 Thread Ted Yu
bq. This was a big help! The email (maybe only addressed to you) didn't come with your latest reply. Do you mind sharing it ? Thanks On Fri, Apr 1, 2016 at 11:37 AM, ludflu wrote: > This was a big help! For the benefit of my fellow travelers running spark > on > EMR: > > I

Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
You can set them in spark-defaults.conf See also https://spark.apache.org/docs/latest/configuration.html#spark-ui On Fri, Apr 1, 2016 at 8:26 AM, Max Schmidt wrote: > Can somebody tell me the interaction between the properties: > > spark.ui.retainedJobs >

<    1   2   3   4   5   6   7   8   9   10   >