Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Adding back user@spark.

Since the top of stack trace is in Datastax class(es), I suggest polling on
their mailing list.

On Mon, May 2, 2016 at 11:29 AM, Piyush Verma <piy...@piyushverma.net>
wrote:

> Hmm weird. They show up on the Web interface.
>
> Wait, got it. Its wrapped up Inside the < raw >..< /raw > so Text-only
> mail clients prune what’s inside.
> Anyway here’s the text again. (Inline)
>
> > On 02-May-2016, at 23:56, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > Maybe you were trying to embed pictures for the error and your code -
> but they didn't go through.
> >
> > On Mon, May 2, 2016 at 10:32 AM, meson10 <sp...@piyushverma.net> wrote:
> > Hi,
> >
> > I am trying to save a RDD to Cassandra but I am running into the
> following
> > error:
>
> [{'key': 3, 'value': 'foobar'}]
>
> [Stage 9:>  (0 +
> 2) / 2]
> [Stage 9:=> (1 +
> 1) / 2]WARN  2016-05-02 17:23:55,240
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 9.0 (TID
> 11, 10.0.6.200): java.lang.NullPointerException
> at com.datastax.bdp.spark.python.RDDPythonFunctions.com
> $datastax$bdp$spark$python$RDDPythonFunctions$$toCassandraRow(RDDPythonFunctions.scala:57)
> at
> com.datastax.bdp.spark.python.RDDPythonFunctions$$anonfun$toCassandraRows$1.apply(RDDPythonFunctions.scala:73)
> at
> com.datastax.bdp.spark.python.RDDPythonFunctions$$anonfun$toCassandraRows$1.apply(RDDPythonFunctions.scala:73)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
> at
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:155)
> at
> com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:139)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
> at
> com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
> at
> com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
> at
> com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:139)
> at
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
> at
> com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> ERROR 2016-05-02 17:23:55,406 org.apache.spark.scheduler.TaskSetManager:
> Task 1 in stage 9.0 failed 4 times; aborting job
> Traceback (most recent call last):
>   File "/home/ubuntu/test-spark.py", line 50, in 
> main()
>   File "/home/ubuntu/test-spark.py", line 47, in main
> runner.run()
>   File "/home/ubuntu/spark_common.py", line 62, in run
> self.save_logs_to_cassandra()
>   File "/home/ubuntu/spark_common.py", line 142, in save_logs_to_cassandra
> rdd.saveToCassandra(keyspace, tablename)
>   File "/usr/share/dse/spark/python/lib/pyspark.zip/pyspark/rdd.py", line
> 2313, in saveToCassandra
>   File
> "/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/usr/share/dse/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o149.saveToCassandra.
> : org.apache.spark.SparkExcep

Re: java.lang.NullPointerException while performing rdd.SaveToCassandra

2016-05-02 Thread Ted Yu
Maybe you were trying to embed pictures for the error and your code - but
they didn't go through.

On Mon, May 2, 2016 at 10:32 AM, meson10  wrote:

> Hi,
>
> I am trying to save a RDD to Cassandra but I am running into the following
> error:
>
>
>
> The Python code looks like this:
>
>
> I am using DSE 4.8.6 which runs Spark 1.4.2
>
> I ran through a bunch of existing posts on this mailing lists and have
> already performed the following routines:
>
>  * Ensure that there is no redundant cassandra .jar lying around,
> interfering with the process.
>  * Wiped clean and reinstall DSE to ensure that.
>  * Tried Loading data from Cassandra to ensure that Spark <-> Cassandra
> communication is working. I usedprint
> self.context.cassandraTable(keyspace='test', table='dummy').collect() to
> validate that.
>  * Ensure there are no null values in my dataset that is being written.
>  * The namespace and the table exist in Cassandra using cassandra@cqlsh>
> SELECT * from test.dummy ;
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-NullPointerException-while-performing-rdd-SaveToCassandra-tp26862.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL with large result size

2016-05-02 Thread Ted Yu
Please consider decreasing block size. 

Thanks

> On May 1, 2016, at 9:19 PM, Buntu Dev  wrote:
> 
> I got a 10g limitation on the executors and operating on parquet dataset with 
> block size 70M with 200 blocks. I keep hitting the memory limits when doing a 
> 'select * from t1 order by c1 limit 100' (ie, 1M). It works if I limit to 
> say 100k. What are the options to save a large dataset without running into 
> memory issues?
> 
> Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):repo1.maven.org: unkn

2016-05-01 Thread Ted Yu
ct org.apache.spark:spark-parent_2.10:1.6.1
> (/data/spark/spark-1.6.1/pom.xml) has 1 error
> [ERROR] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> Connect to repo1.maven.org:443 [repo1.maven.org/23.235.47.209] failed:
> Connection timed out and 'parent.relativePath' points at wrong local POM @
> line 22, column 11 -> [Help 2]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
> [ERROR] [Help 2]
> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>
>
> -- 原始邮件 --
> *发件人:* "Ted Yu";<yuzhih...@gmail.com>;
> *发送时间:* 2016年5月1日(星期天) 晚上9:50
> *收件人:* "sunday2000"<2314476...@qq.com>;
> *抄送:* "user"<user@spark.apache.org>;
> *主题:* Re: Why Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.10:1.6.1:Couldnot transfer artifact
> org.apache:apache:pom:14 from/to central(
> https://repo1.maven.org/maven2):repo1.maven.org: unknown error
>
> FYI
>
> Accessing the link below gave me 'Page does not exist'
>
> I am in California.
>
> I checked the dependency tree of 1.6.1 - I didn't see such dependence.
>
> Can you pastebin related maven output ?
>
> Thanks
>
> On Sun, May 1, 2016 at 6:32 AM, sunday2000 <2314476...@qq.com> wrote:
>
>> Seems it is because fail to download this url:
>> http://maven.twttr.com/org/apache/apache/14/apache-14.pom
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Ted Yu";<yuzhih...@gmail.com>;
>> *发送时间:* 2016年5月1日(星期天) 晚上9:27
>> *收件人:* "sunday2000"<2314476...@qq.com>;
>> *抄送:* "user"<user@spark.apache.org>;
>> *主题:* Re: Why Non-resolvable parent POM for
>> org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact
>> org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):
>> repo1.maven.org: unknown error
>>
>> bq. Non-resolvable parent POM for org.apache.spark:spark-parent_
>> 2.10:1.6.1
>>
>> Looks like you were using Spark 1.6.1
>>
>> Can you check firewall settings ?
>>
>> I saw similar report from Chinese users.
>>
>> Consider using proxy.
>>
>> On Sun, May 1, 2016 at 4:19 AM, sunday2000 <2314476...@qq.com> wrote:
>>
>>> Hi,
>>>   We are compiling spare 1.6.0 in a linux server, while getting this
>>> error message. Could you tell us how to solve it? thanks.
>>>
>>> [INFO] Scanning for projects...
>>> Downloading:
>>> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> https://repository.apache.org/content/repositories/releases/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> https://repository.jboss.org/nexus/content/repositories/releases/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> https://repo.eclipse.org/content/repositories/paho-releases/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> https://oss.sonatype.org/content/repositories/orgspark-project-1113/org/apache/apache/14/apache-14.pom
>>> Downloading:
>>> http://repository.mapr.com/maven/org/apache/apache/14/apache-14.pom
>>> Downloading: http://maven.twttr.com/org/apache/apache/14/apache-14.pom
>>> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
>>> [FATAL] Non-resolvable parent POM for
>>> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
>>> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
>>> repo1.maven.org: unknown error and 'parent.relativePath' points at
>>> wrong local POM @ line 22, column 11
>>>  @
>>> [ERROR] The build could not read 1 project -> [Help 1]
>>> [ERROR]
>>> [ERROR]   The project org.apache.spark:spark-parent_2.10:1.6.1
>>> (/data/spark/spark-1.6.1/pom.xml) has 1 error
>>> [ERROR] Non-resolvable parent POM for
>>> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
>>> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
>>> repo1.maven.org: unknown error and 'parent.relativePath' points at
>>> wrong local POM @ line 22, column 11: Unknown host repo1.maven.org:
>>> unknown error -> [Help 2]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>>> -e switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please read the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
>>> [ERROR] [Help 2]
>>> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>>>
>>
>>
>


Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2): repo1.maven.org: un

2016-05-01 Thread Ted Yu
FYI

Accessing the link below gave me 'Page does not exist'

I am in California.

I checked the dependency tree of 1.6.1 - I didn't see such dependence.

Can you pastebin related maven output ?

Thanks

On Sun, May 1, 2016 at 6:32 AM, sunday2000 <2314476...@qq.com> wrote:

> Seems it is because fail to download this url:
> http://maven.twttr.com/org/apache/apache/14/apache-14.pom
>
>
> -- 原始邮件 ------
> *发件人:* "Ted Yu";<yuzhih...@gmail.com>;
> *发送时间:* 2016年5月1日(星期天) 晚上9:27
> *收件人:* "sunday2000"<2314476...@qq.com>;
> *抄送:* "user"<user@spark.apache.org>;
> *主题:* Re: Why Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.10:1.6.1:Could not transfer artifact
> org.apache:apache:pom:14 from/to central(https://repo1.maven.org/maven2):
> repo1.maven.org: unknown error
>
> bq. Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1
>
> Looks like you were using Spark 1.6.1
>
> Can you check firewall settings ?
>
> I saw similar report from Chinese users.
>
> Consider using proxy.
>
> On Sun, May 1, 2016 at 4:19 AM, sunday2000 <2314476...@qq.com> wrote:
>
>> Hi,
>>   We are compiling spare 1.6.0 in a linux server, while getting this
>> error message. Could you tell us how to solve it? thanks.
>>
>> [INFO] Scanning for projects...
>> Downloading:
>> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
>> Downloading:
>> https://repository.apache.org/content/repositories/releases/org/apache/apache/14/apache-14.pom
>> Downloading:
>> https://repository.jboss.org/nexus/content/repositories/releases/org/apache/apache/14/apache-14.pom
>> Downloading:
>> https://repo.eclipse.org/content/repositories/paho-releases/org/apache/apache/14/apache-14.pom
>> Downloading:
>> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/apache/14/apache-14.pom
>> Downloading:
>> https://oss.sonatype.org/content/repositories/orgspark-project-1113/org/apache/apache/14/apache-14.pom
>> Downloading:
>> http://repository.mapr.com/maven/org/apache/apache/14/apache-14.pom
>> Downloading: http://maven.twttr.com/org/apache/apache/14/apache-14.pom
>> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
>> [FATAL] Non-resolvable parent POM for
>> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
>> repo1.maven.org: unknown error and 'parent.relativePath' points at wrong
>> local POM @ line 22, column 11
>>  @
>> [ERROR] The build could not read 1 project -> [Help 1]
>> [ERROR]
>> [ERROR]   The project org.apache.spark:spark-parent_2.10:1.6.1
>> (/data/spark/spark-1.6.1/pom.xml) has 1 error
>> [ERROR] Non-resolvable parent POM for
>> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
>> repo1.maven.org: unknown error and 'parent.relativePath' points at wrong
>> local POM @ line 22, column 11: Unknown host repo1.maven.org: unknown
>> error -> [Help 2]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
>> [ERROR] [Help 2]
>> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>>
>
>


Re: Can not import KafkaProducer in spark streaming job

2016-05-01 Thread Ted Yu
According to
examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
:

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig,
ProducerRecord}

Can you give the command line you used to submit the job ?

Probably classpath issue.

On Sun, May 1, 2016 at 5:11 AM, fanooos  wrote:

> I have a very strange problem.
>
> I wrote a spark streaming job that monitor an HDFS directory, read the
> newly
> added files, and send the contents to Kafka.
>
> The job is written in python and you can got the code from this link
>
> http://pastebin.com/mpKkMkph
>
> When submitting the job I got that error
>
> *ImportError: cannot import name KafkaProducer*
>
> As you see, the error is very simple but the problem is that I could import
> the KafkaProducer from both python and pyspark shells without any problem.
>
> I tried to reboot the machine but the situation remain the same.
>
> What do you think the problem is?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-import-KafkaProducer-in-spark-streaming-job-tp26857.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Why Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2): repo1.maven.org:

2016-05-01 Thread Ted Yu
bq. Non-resolvable parent POM for org.apache.spark:spark-parent_2.10:1.6.1

Looks like you were using Spark 1.6.1

Can you check firewall settings ?

I saw similar report from Chinese users.

Consider using proxy.

On Sun, May 1, 2016 at 4:19 AM, sunday2000 <2314476...@qq.com> wrote:

> Hi,
>   We are compiling spare 1.6.0 in a linux server, while getting this error
> message. Could you tell us how to solve it? thanks.
>
> [INFO] Scanning for projects...
> Downloading:
> https://repo1.maven.org/maven2/org/apache/apache/14/apache-14.pom
> Downloading:
> https://repository.apache.org/content/repositories/releases/org/apache/apache/14/apache-14.pom
> Downloading:
> https://repository.jboss.org/nexus/content/repositories/releases/org/apache/apache/14/apache-14.pom
> Downloading:
> https://repo.eclipse.org/content/repositories/paho-releases/org/apache/apache/14/apache-14.pom
> Downloading:
> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/apache/14/apache-14.pom
> Downloading:
> https://oss.sonatype.org/content/repositories/orgspark-project-1113/org/apache/apache/14/apache-14.pom
> Downloading:
> http://repository.mapr.com/maven/org/apache/apache/14/apache-14.pom
> Downloading: http://maven.twttr.com/org/apache/apache/14/apache-14.pom
> [ERROR] [ERROR] Some problems were encountered while processing the POMs:
> [FATAL] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> repo1.maven.org: unknown error and 'parent.relativePath' points at wrong
> local POM @ line 22, column 11
>  @
> [ERROR] The build could not read 1 project -> [Help 1]
> [ERROR]
> [ERROR]   The project org.apache.spark:spark-parent_2.10:1.6.1
> (/data/spark/spark-1.6.1/pom.xml) has 1 error
> [ERROR] Non-resolvable parent POM for
> org.apache.spark:spark-parent_2.10:1.6.1: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
> repo1.maven.org: unknown error and 'parent.relativePath' points at wrong
> local POM @ line 22, column 11: Unknown host repo1.maven.org: unknown
> error -> [Help 2]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/ProjectBuildingException
> [ERROR] [Help 2]
> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>


Re: Dataframe saves for a large set but throws OOM for a small dataset

2016-04-30 Thread Ted Yu
Can you provide a bit more information:

Does the smaller dataset have skew ?

Which release of Spark are you using ?

How much memory did you specify ?

Thanks

On Sat, Apr 30, 2016 at 1:17 PM, Brandon White 
wrote:

> Hello,
>
> I am writing to datasets. One dataset is x2 larger than the other. Both
> datasets are written to parquet the exact same way using
>
> df.write.mode("Overwrite").parquet(outputFolder)
>
> The smaller dataset OOMs while the larger dataset writes perfectly fine.
> Here is the stack trace: Any ideas what is going on here and how I can fix
> it?
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2367)
> at
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> at java.lang.StringBuilder.append(StringBuilder.java:132)
> at scala.StringContext.standardInterpolator(StringContext.scala:123)
> at scala.StringContext.s(StringContext.scala:90)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toString(SQLContext.scala:947)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)
>


Re: [2 BUG REPORT] failed to run make-distribution.sh when a older version maven installed in system and run VersionsSuite test hang

2016-04-28 Thread Ted Yu
For #1, have you seen this JIRA ?

[SPARK-14867][BUILD] Remove `--force` option in `build/mvn`

On Thu, Apr 28, 2016 at 8:27 PM, Demon King  wrote:

> BUG 1:
> I have installed maven 3.0.2 in system,  When I using make-distribution.sh
> , it seem not use maven 3.2.2 but use /usr/local/bin/mvn to build spark. So
> I add --force option in make-distribution.sh like this:
>
> line 130:
> VERSION=$("$MVN" *--force* help:evaluate -Dexpression=project.version $@
> 2>/dev/null | grep -v "INFO" | tail -n 1)
> SCALA_VERSION=$("$MVN"* --force* help:evaluate
> -Dexpression=scala.binary.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HADOOP_VERSION=$("$MVN" *--force* help:evaluate
> -Dexpression=hadoop.version $@ 2>/dev/null\
> | grep -v "INFO"\
> | tail -n 1)
> SPARK_HIVE=$("$MVN"* --force* help:evaluate
> -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
> | grep -v "INFO"\
> | fgrep --count "hive";\
> # Reset exit status to 0, otherwise the script stops here if the last
> grep finds nothing\
> # because we use "set -o pipefail"
> echo -n)
>
> line 170:
> BUILD_COMMAND=("$MVN" *--force* clean package -DskipTests $@)
>
> that will force spark to use build/mvn and solve this problem.
>
> BUG 2:
>
> When I run running unit test VersionsSuite, it will hang for one night or
> more. I use jstack and lsof and find it try to send a http request. That
> seems not be a good item when runing test in terrible network.
>
> I use jstack and finally find out reason:
>
>   java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> - locked <0x0007440224d8> (a java.io.BufferedInputStream)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1323)
> - locked <0x000744022530> (a
> sun.net.www.protocol.http.HttpURLConnection)
> at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
> ...
>
> and I use lsof:
>
> java32082 user 247u  IPv4 527001934   TCP 8.8.8.8:33233 (LISTEN)
> java32082 user  267u  IPv4 527001979   TCP 8.8.8.8:52301 (LISTEN)
> java32082 user  316u  IPv4 527001999   TCP *:51993 (LISTEN)
> java32082 user  521u  IPv4 527111590   TCP 8.8.8.8:53286
> ->butan141.server4you.de:http (ESTABLISHED)
>
> This test suite try to connect butan141.server4you.de, The process will
> hang when network is terrible .
>
>
>


Re: Could not access Spark webUI on OpenStack VMs

2016-04-28 Thread Ted Yu
What happened when you tried to access port 8080 ?

Checking iptables settings is good to do.

At my employer, we use OpenStack clusters daily and don't encounter much
problem - including UI access.
Probably some settings should be tuned.

On Thu, Apr 28, 2016 at 5:03 AM, Dan Dong  wrote:

> Hi, all,
>   I'm having problem to access the web UI of my Spark cluster. The cluster
> is composed of a few virtual machines running on a OpenStack platform. The
> VMs are launched from CentOS7.0 server image available from official site.
> The Spark itself runs well and master and worker process are all up and
> running, and run SparkPi example also get expected result. So, the question
> is, how to debug such a problem? Should it be a native problem of the
> CentOS image as it is not a desktop version, so some graphic packages might
> be missing in the VM?
> Or it is a iptables settings problem comes from OpenStack, as Openstack
> configures complex network inside it and it might block certain
> communication?
> Does anybody find similar problems? Any hints will be appreciated. Thanks!
>
> Cheers,
> Dong
>
>


Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-28 Thread Ted Yu
Interesting.

The phoenix dependency wasn't shown in the classpath of your previous email.

On Thu, Apr 28, 2016 at 4:12 AM, pierre lacave <pie...@lacave.me> wrote:

> Narrowed down to some version incompatibility with Phoenix 4.7 ,
>
> Including $SPARK_HOME/lib/phoenix-4.7.0-HBase-1.1-client-spark.jar to
> extraClassPath and that trigger the issue above.
>
> I ll have a go at adding the individual dependencies as opposed to this
> fat jar and see how it goes.
>
> Thanks
>
>
> On Thu, Apr 28, 2016 at 10:52 AM, pierre lacave <pie...@lacave.me> wrote:
>
>> Thanks Ted,
>>
>> I am actually using the hadoop free version of spark
>> (spark-1.5.0-bin-without-hadoop) over hadoop 2.6.1, so could very well be
>> related indeed.
>>
>> I have configured spark-env.sh with export
>> SPARK_DIST_CLASSPATH=$($HADOOP_PREFIX/bin/hadoop classpath), which is the
>> only version of hadoop on the system (2.6.1) and able to interface with
>> hdfs (on no secured zones)
>>
>> Interestingly running this in the repl works fine
>>
>> // Create a simple DataFrame, stored into a partition directory
>> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
>> df1.write.parquet("/securedzone/test")
>>
>>
>> but if packaged as an app and ran in local or yarn client/cluster mode,
>> it fails with the error described.
>>
>> I am not including any hadoop specific, so not sure where the difference
>> in DFSClient could come from.
>>
>> [info] Loading project definition from
>> /Users/zoidberg/Documents/demo/x/trunk/src/jobs/project
>> [info] Set current project to root (in build
>> file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/)
>> [info] Updating
>> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}common...
>> [info] com.demo.project:root_2.10:0.2.3 [S]
>> [info] com.demo.project:common_2.10:0.2.3 [S]
>> [info]   +-joda-time:joda-time:2.8.2
>> [info]
>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>> [info] Done updating.
>> [info] Updating
>> {file:/Users/zoidberg/Documents/demo/x/trunk/src/jobs/}extract...
>> [info] Resolving org.fusesource.jansi#jansi;1.4 ...
>> [info] Done updating.
>> [info] com.demo.project:extract_2.10:0.2.3 [S]
>> [info]   +-com.demo.project:common_2.10:0.2.3 [S]
>> [info]   | +-joda-time:joda-time:2.8.2
>> [info]   |
>> [info]   +-com.databricks:spark-csv_2.10:1.3.0 [S]
>> [info] +-com.univocity:univocity-parsers:1.5.1
>> [info] +-org.apache.commons:commons-csv:1.1
>> [info]
>> [success] Total time: 9 s, completed 28-Apr-2016 10:40:25
>>
>>
>> I am assuming I do not need to rebuild spark to use it with hadoop 2.6.1
>> and that the spark with user provided hadoop would let me do that,
>>
>>
>> $HADOOP_PREFIX/bin/hadoop classpath expends to:
>>
>>
>> /usr/local/project/hadoop/conf:/usr/local/project/hadoop/share/hadoop/common/lib/*:/usr/local/project/hadoop/share/hadoop/common/*:/usr/local/project/hadoop/share/hadoop/hdfs:/usr/local/project/hadoop/share/hadoop/hdfs/lib/*:/usr/local/project/hadoop/share/hadoop/hdfs/*:/usr/local/project/hadoop/share/hadoop/yarn/lib/*:/usr/local/project/hadoop/share/hadoop/yarn/*:/usr/local/project/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/project/hadoop/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar
>>
>> Thanks
>>
>>
>> On Sun, Apr 24, 2016 at 2:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Can you check that the DFSClient Spark uses is the same version as on
>>> the server side ?
>>>
>>> The client and server (NameNode) negotiate a "crypto protocol version" -
>>> this is a forward-looking feature.
>>> Please note:
>>>
>>> bq. Client provided: []
>>>
>>> Meaning client didn't provide any supported crypto protocol version.
>>>
>>> Cheers
>>>
>>> On Wed, Apr 20, 2016 at 3:27 AM, pierre lacave <pie...@lacave.me> wrote:
>>>
>>>> Hi
>>>>
>>>>
>>>> I am trying to use spark to write to a protected zone in hdfs, I am able 
>>>> to create and list file using the hdfs client but when writing via Spark I 
>>>> get this exception.
>>>>
>>>> I could not find any mention of CryptoProtocolVersion in the spark doc.
>>>>
>>>>
>>>> Any idea what could have gone wrong?
>>>>
>>>>
>>>> spark (1.5.0), hadoop (2.6.1)
>>>>
>>>>
>&g

Re: Save DataFrame to HBase

2016-04-28 Thread Ted Yu
Hbase 2.0 release likely would come after Spark 2.0 release. 

There're other features being developed in hbase 2.0
I am not sure when hbase 2.0 would be released. 

The refguide is incomplete. 
Zhan has assigned the doc JIRA to himself. The documentation would be done 
after fixing bugs in hbase-spark module. 

Cheers

> On Apr 27, 2016, at 10:31 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Hi Ted,
> 
> Do you know when the release will be? I also see some documentation for usage 
> of the hbase-spark module at the hbase website. But, I don’t see an example 
> on how to save data. There is only one for reading/querying data. Will this 
> be added when the final version does get released?
> 
> Thanks,
> Ben
> 
>> On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> 
>> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can 
>> do this.
>> 
>>> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: what should I do when spark ut hang?

2016-04-27 Thread Ted Yu
Did you have a chance to take jstack when VersionsSuite was running ?

You can use the following command to run the test:

sbt/sbt "test-only org.apache.spark.sql.hive.client.VersionsSuite"

On Wed, Apr 27, 2016 at 9:01 PM, Demon King  wrote:

> Hi, all:
>I compile spark-1.6.1  in redhat 5.7(I have installed
> spark-1.6.0-cdh5.7.0 hive-1.1.0+cdh5.7.0 and hadoop-2.6.0+cdh5.7.0 in this
> machine). my compile cmd is:
>
> build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver
>
>  when I use this cmd to run unit test. test hanged for a whole
> night(nearly 10 hour):
>
>  build/mvn --force -Psparkr -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver  test
>
>  That is the last log:
>
> Discovery completed in 25 seconds, 803 milliseconds.
> Run starting. Expected test count is: 1716
> StatisticsSuite:
> - parse analyze commands
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> log4j:WARN No appenders could be found for logger (hive.log).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> Warning: fs.defaultFS is not set when running "chgrp" command.
> Warning: fs.defaultFS is not set when running "chmod" command.
> - analyze MetastoreRelations
> - estimates the size of a test MetastoreRelation
> - auto converts to broadcast hash join, by size estimate of a relation
> - auto converts to broadcast left semi join, by size estimate of a relation
> VersionsSuite:
>
>
>  I
> modify 
> ./sql/hive/target/scala-2.10/test-classes/org/apache/spark/sql/hive/client/VersionsSuite.class
> and add print statement in each beginning of function, but no message
> print. Now what should I do to solve this problem?
>
>  Thank you!
>
>


Re: Cant join same dataframe twice ?

2016-04-27 Thread Ted Yu
I wonder if Spark can provide better support for this case.

The following schema is not user friendly (shown previsouly):

StructField(b,IntegerType,false), StructField(b,IntegerType,false)

Except for 'select *', there is no way for user to query any of the two
fields.

On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Based on my example, how about renaming columns?
>
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"),
> df2("b").as("2-b"))
> val df4 = df3.join(df2, df3("2-b") === df2("b"))
>
> // maropu
>
> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Correct Takeshi
>> Even I am facing the same issue .
>>
>> How to avoid the ambiguity ?
>>
>>
>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I tried;
>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>>> val df3 = df1.join(df2, "a")
>>> val df4 = df3.join(df2, "b")
>>>
>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
>>> ambiguous, could be: b#6, b#14.;
>>> If same case, this message makes sense and this is clear.
>>>
>>> Thought?
>>>
>>> // maropu
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
>>> wrote:
>>>
>>>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>>>
>>>> Prasad.
>>>>
>>>> From: Ted Yu
>>>> Date: Monday, April 25, 2016 at 8:35 PM
>>>> To: Divya Gehlot
>>>> Cc: "user @spark"
>>>> Subject: Re: Cant join same dataframe twice ?
>>>>
>>>> Can you show us the structure of df2 and df3 ?
>>>>
>>>> Thanks
>>>>
>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> I am using Spark 1.5.2 .
>>>>> I have a use case where I need to join the same dataframe twice on two
>>>>> different columns.
>>>>> I am getting error missing Columns
>>>>>
>>>>> For instance ,
>>>>> val df1 = df2.join(df3,"Column1")
>>>>> Below throwing error missing columns
>>>>> val df 4 = df1.join(df3,"Column2")
>>>>>
>>>>> Is the bug or valid scenario ?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Divya
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: error: value toDS is not a member of Seq[Int] SQL

2016-04-27 Thread Ted Yu
Did you do the import as the first comment shows ?

> On Apr 27, 2016, at 2:42 AM, shengshanzhang  wrote:
> 
> Hi,
> 
>   On spark website, there is code as follows showing how to create 
> datasets.
>  
> 
>   However when i input this line into spark-shell,there comes a Error, 
> and who can tell me Why and how to fix this?
> 
> scala> val ds = Seq(1, 2, 3).toDS()
> :35: error: value toDS is not a member of Seq[Int]
>val ds = Seq(1, 2, 3).toDS()
> 
> 
> 
>   Thank you a lot!


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Cant join same dataframe twice ?

2016-04-26 Thread Ted Yu
The ambiguity came from:

scala> df3.schema
res0: org.apache.spark.sql.types.StructType =
StructType(StructField(a,IntegerType,false),
StructField(b,IntegerType,false), StructField(b,IntegerType,false))

On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> I tried;
> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
> val df3 = df1.join(df2, "a")
> val df4 = df3.join(df2, "b")
>
> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is
> ambiguous, could be: b#6, b#14.;
> If same case, this message makes sense and this is clear.
>
> Thought?
>
> // maropu
>
>
>
>
>
>
>
> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com>
> wrote:
>
>> Also, check the column names of df1 ( after joining df2 and df3 ).
>>
>> Prasad.
>>
>> From: Ted Yu
>> Date: Monday, April 25, 2016 at 8:35 PM
>> To: Divya Gehlot
>> Cc: "user @spark"
>> Subject: Re: Cant join same dataframe twice ?
>>
>> Can you show us the structure of df2 and df3 ?
>>
>> Thanks
>>
>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I am using Spark 1.5.2 .
>>> I have a use case where I need to join the same dataframe twice on two
>>> different columns.
>>> I am getting error missing Columns
>>>
>>> For instance ,
>>> val df1 = df2.join(df3,"Column1")
>>> Below throwing error missing columns
>>> val df 4 = df1.join(df3,"Column2")
>>>
>>> Is the bug or valid scenario ?
>>>
>>>
>>>
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Reading from Amazon S3

2016-04-26 Thread Ted Yu
Looking at the cause of the error, it seems hadoop-aws-xx.jar
(corresponding to the version of hadoop you use) was missing in classpath.

FYI

On Tue, Apr 26, 2016 at 9:06 AM, Jinan Alhajjaj 
wrote:

> Hi All,
> I am trying to read a file stored in Amazon S3.
> I wrote this code:
>
> import java.util.List;
>
> import java.util.Scanner;
>
> import org.apache.spark.SparkConf;
>
> import org.apache.spark.api.java.JavaRDD;
>
> import org.apache.spark.api.java.JavaSparkContext;
>
> import org.apache.spark.api.java.function.Function;
>
> import org.apache.spark.sql.DataFrame;
>
> import org.apache.spark.sql.Row;
>
> import org.apache.spark.sql.SQLContext;
>
> public class WordAnalysis {
>
> public static void main(String[] args) {
>
> int startYear=0;
>
> int endyear=0;
>
> Scanner input = new Scanner(System.in);
>
> System.out.println("Please, Enter 1 if you want 1599-2008 or enter 2
> for specific range: ");
>
> int choice=input.nextInt();
>
>
>
> if(choice==1)
>
> {
>
> startYear=1500;
>
> endyear=2008;
>
> }
>
> if(choice==2)
>
> {
>
> System.out.print("please,Enter the start year : ");
>
> startYear = input.nextInt();
>
> System.out.print("please,Enter the end year : ");
>
> endyear = input.nextInt();
>
> }
>
> SparkConf conf = new SparkConf().setAppName("jinantry").setMaster("local"
> );
>
> JavaSparkContext spark = new JavaSparkContext(conf);
>
> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(spark);
>
> JavaRDD ngram = spark.textFile(
> "s3n://google-books-ngram/1gram/googlebooks-eng-all-1gram-20120701-x.gz‏")
>
> .map(new Function() {
>
> public Items call(String line) throws Exception {
>
> String[] parts = line.split("\t");
>
> Items item = new Items();
>
> if (parts.length == 4) {
>
> item.setWord(parts[0]);
>
> item.setYear(Integer.parseInt(parts[1]));
>
> item.setCount(Integer.parseInt(parts[2]));
>
> item.setVolume(Integer.parseInt(parts[3]));
>
> return item;
>
> } else {
>
> item.setWord(" ");
>
> item.setYear(Integer.parseInt(" "));
>
> item.setCount(Integer.parseInt(" "));
>
> item.setVolume(Integer.parseInt(" "));
>
> return item;
>
> }
>
> }
>
> });
>
> DataFrame schemangram = sqlContext.createDataFrame(ngram, Items.class);
>
> schemangram.registerTempTable("ngram");
>
> String sql="SELECT word,SUM(count) FROM ngram where year >="+startYear+"
> AND year<="+endyear+" And word LIKE '%_NOUN' GROUP BY word ORDER BY
> SUM(count) DESC";
>
> DataFrame matchyear = sqlContext.sql(sql);
>
> List words=matchyear.collectAsList();
>
> int i=1;
>
> for (Row scholar : words) {
>
> System.out.println(scholar);
>
> if(i==10)
>
> break;
>
> i++;
>
>   }
>
>
> }
>
>
> }
>
>
> When I run it this error appear to me:
>
> Exception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
> tree:
>
> Exchange rangepartitioning(aggOrder#5L DESC,200), None
>
> +- ConvertToSafe
>
>+- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0 as
> bigint)),mode=Final,isDistinct=false)], output=[word#2,_c1#4L,aggOrder#5L])
>
>   +- TungstenExchange hashpartitioning(word#2,200), None
>
>  +- TungstenAggregate(key=[word#2], functions=[(sum(cast(count#0
> as bigint)),mode=Partial,isDistinct=false)], output=[word#2,sum#8L])
>
> +- Project [word#2,count#0]
>
>+- Filter (((year#3 >= 1500) && (year#3 <= 1600)) && word#2
> LIKE %_NOUN)
>
>   +- Scan ExistingRDD[count#0,volume#1,word#2,year#3]
>
>
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>
> at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at
> org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at org.apache.spark.sql.execution.Sort.doExecute(Sort.scala:64)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>
> at
> 

Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Ted Yu
Please take a look at:
core/src/main/scala/org/apache/spark/SparkContext.scala

   * Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
   *
   *  then `rdd` contains
   * {{{
   *   (a-hdfs-path/part-0, its content)
   *   (a-hdfs-path/part-1, its content)
   *   ...
   *   (a-hdfs-path/part-n, its content)
   * }}}
...
  * @param minPartitions A suggestion value of the minimal splitting number
for input data.

  def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
withScope {

On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
wrote:

> Hi guys,
>
> I'm trying to read many filed from s3 using
> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
> manner? Please give me a link to the place in documentation where it's
> specified.
>
> Thanks, Vadim.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Cant join same dataframe twice ?

2016-04-25 Thread Ted Yu
Can you show us the structure of df2 and df3 ?

Thanks

On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot 
wrote:

> Hi,
> I am using Spark 1.5.2 .
> I have a use case where I need to join the same dataframe twice on two
> different columns.
> I am getting error missing Columns
>
> For instance ,
> val df1 = df2.join(df3,"Column1")
> Below throwing error missing columns
> val df 4 = df1.join(df3,"Column2")
>
> Is the bug or valid scenario ?
>
>
>
>
> Thanks,
> Divya
>


Re: reduceByKey as Action or Transformation

2016-04-25 Thread Ted Yu
Can you show snippet of your code which demonstrates what you observed ?

Thansk

On Mon, Apr 25, 2016 at 8:38 AM, Weiping Qu  wrote:

> Thanks.
> I read that from the specification.
> I thought the way people distinguish actions and transformations depends
> on whether they are lazily executed or not.
> As far as I saw from my codes, the reduceByKey will be executed without
> any operations in the Action category.
> Please correct me if I am wrong.
>
> Thanks,
> Regards,
> Weiping
>
> On 25.04.2016 17:20, Chadha Pooja wrote:
>
>> Reduce By Key is a Transformation
>>
>> http://spark.apache.org/docs/latest/programming-guide.html#transformations
>>
>> Thanks
>>
>> _
>>
>> Pooja Chadha
>> Senior Architect
>> THE BOSTON CONSULTING GROUP
>> Mobile +1 617 794 3862
>>
>>
>> _
>>
>>
>>
>> -Original Message-
>> From: Weiping Qu [mailto:q...@informatik.uni-kl.de]
>> Sent: Monday, April 25, 2016 11:05 AM
>> To: u...@spark.incubator.apache.org
>> Subject: reduceByKey as Action or Transformation
>>
>> Hi,
>>
>> I'd like just to verify that whether reduceByKey is transformation or
>> actions.
>> As written in RDD papers, spark flow will not be triggered only if
>> actions are reached.
>> I tried and saw that the my flow will be executed once there is a
>> reduceByKey while it is categorized into transformations in Spark 1.6.1
>> specification.
>>
>> Thanks and Regards,
>> Weiping
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>> __
>> The Boston Consulting Group, Inc.
>>   This e-mail message may contain confidential and/or privileged
>> information.
>> If you are not an addressee or otherwise authorized to receive this
>> message,
>> you should not use, copy, disclose or take any action based on this
>> e-mail or
>> any information contained in the message. If you have received this
>> material
>> in error, please advise the sender immediately by reply e-mail and delete
>> this
>> message. Thank you.
>>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: next on empty iterator though i used hasNext

2016-04-25 Thread Ted Yu
Can you show more of your code inside the while loop ?

Which version of Spark / Kinesis do you use ?

Thanks

On Mon, Apr 25, 2016 at 4:04 AM, Selvam Raman  wrote:

> I am reading a data from Kinesis stream (merging shard values with union
> stream) to spark streaming. then doing the following code to push the data
> to DB.
> ​
>
> splitCSV.foreachRDD(new VoidFunction2,Time>()
> {
>
> private static final long serialVersionUID = 1L;
>
> public void call(JavaRDD rdd, Time time) throws Exception
> {
> JavaRDD varMapRDD = rdd.map(new Function()
> {
> private static final long serialVersionUID = 1L;
>
> public SFieldBean call(String[] values) throws Exception
> {
> .
> );
>
> varMapRDD.foreachPartition(new VoidFunction(
> {
> private static final long serialVersionUID = 1L;
> MySQLConnectionHelper.getConnection("urlinfo");
> @Override
> public void call(Iterator iterValues) throws Exception
> {
> 
> while(iterValues.hasNext())
> {
>
> }
> }
>
> Though I am using hasNext but it throws the follwing error
> ​
> Caused by: java.util.NoSuchElementException: next on empty iterator
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
> at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
> at
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
> at
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:30)
> at com.test.selva.KinesisToSpark$2$2.call(KinesisToSpark.java:319)
> at com.test.selva.KinesisToSpark$2$2.call(KinesisToSpark.java:288)
> at
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
> at
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:225)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> ... 3 more
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Using Aggregate and group by on spark Dataset api

2016-04-24 Thread Ted Yu
Have you taken a look at:

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

On Sun, Apr 24, 2016 at 8:18 AM, coder  wrote:

> JavaRDD prdd = sc.textFile("c:\\fls\\people.txt").map(
>   new Function() {
> public Person call(String line) throws Exception {
>   String[] parts = line.split(",");
>   Person person = new Person();
>   person.setName(parts[0]);
>   person.setAge(Integer.parseInt(parts[1].trim()));
>  person.setSal(Integer.parseInt(parts[2].trim()));
>   return person;
> }
>   });
>
>RDDpersonRDD =  prdd.toRDD(prdd);
>Dataset dss= sqlContext.createDataset(personRDD ,
> Encoders.bean(Person.class));
>GroupedDataset dq=dss.groupBy(new Column("name"));
>
> I have to calculate sum of age and salary group by name on the dataset.
> Please help how to query dataset ? I tried using GroupedDataset but don't
> know how to proceed with it.
>
> I acn not find much help for using dataset api.
> Please help
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Aggregate-and-group-by-on-spark-Dataset-api-tp26824.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark Streaming Job get killed after running for about 1 hour

2016-04-24 Thread Ted Yu
Which version of Spark are you using ?

How did you increase the open file limit ?

Which operating system do you use ?

Please see Example 6. ulimit Settings on Ubuntu under:
http://hbase.apache.org/book.html#basic.prerequisites

On Sun, Apr 24, 2016 at 2:34 AM, fanooos  wrote:

> I have a spark streaming job that read tweets stream from gnip and write it
> to Kafak.
>
> Spark and kafka are running on the same cluster.
>
> My cluster consists of 5 nodes. Kafka-b01 ... Kafka-b05
>
> Spark master is running on Kafak-b05.
>
> Here is how we submit the spark job
>
> *nohup sh $SPZRK_HOME/bin/spark-submit --total-executor-cores 5 --class
> org.css.java.gnipStreaming.GnipSparkStreamer --master
> spark://kafka-b05:7077
> GnipStreamContainer.jar powertrack
> kafka-b01.css.org,kafka-b02.css.org,kafka-b03.css.org,kafka-b04.css.org,
> kafka-b05.css.org
> gnip_live_stream 2 &*
>
> After about 1 hour the spark job get killed
>
> The logs in the nohub file shows the following exception
>
> /org.apache.spark.storage.BlockFetchException: Failed to fetch block from 2
> locations. Most recent failure cause:
> at
>
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
> at
>
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> at
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> at
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
> at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:48)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: io.netty.channel.ChannelException: Unable to create Channel from
> class class io.netty.channel.socket.nio.NioSocketChannel
> at
>
> io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:455)
> at
>
> io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:306)
> at io.netty.bootstrap.Bootstrap.doConnect(Bootstrap.java:134)
> at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:116)
> at
>
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:211)
> at
>
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
>
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at
>
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
>
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
>
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:99)
> at
>
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
> at
>
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> ... 15 more
> Caused by: io.netty.channel.ChannelException: Failed to open a socket.
> at
>
> io.netty.channel.socket.nio.NioSocketChannel.newSocket(NioSocketChannel.java:62)
> at
>
> io.netty.channel.socket.nio.NioSocketChannel.(NioSocketChannel.java:72)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at java.lang.Class.newInstance(Class.java:442)
> at
>
> io.netty.bootstrap.AbstractBootstrap$BootstrapChannelFactory.newChannel(AbstractBootstrap.java:453)
> ... 26 more
> Caused by: java.net.SocketException: Too many open files
> at sun.nio.ch.Net.socket0(Native Method)
> at sun.nio.ch.Net.socket(Net.java:411)
> at sun.nio.ch.Net.socket(Net.java:404)
> at 

Re: Spark writing to secure zone throws : UnknownCryptoProtocolVersionException

2016-04-23 Thread Ted Yu
Can you check that the DFSClient Spark uses is the same version as on the
server side ?

The client and server (NameNode) negotiate a "crypto protocol version" -
this is a forward-looking feature.
Please note:

bq. Client provided: []

Meaning client didn't provide any supported crypto protocol version.

Cheers

On Wed, Apr 20, 2016 at 3:27 AM, pierre lacave  wrote:

> Hi
>
>
> I am trying to use spark to write to a protected zone in hdfs, I am able to 
> create and list file using the hdfs client but when writing via Spark I get 
> this exception.
>
> I could not find any mention of CryptoProtocolVersion in the spark doc.
>
>
> Any idea what could have gone wrong?
>
>
> spark (1.5.0), hadoop (2.6.1)
>
>
> Thanks
>
>
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.UnknownCryptoProtocolVersionException):
>  No crypto protocol versions provided by the client are supported. Client 
> provided: [] NameNode supports: [CryptoProtocolVersion{description='Unknown', 
> version=1, unknownValue=null}, CryptoProtocolVersion{description='Encryption 
> zones', version=2, unknownValue=null}]
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.chooseProtocolVersion(FSNamesystem.java:2468)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2600)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2520)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:579)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:394)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)
>
>   at org.apache.hadoop.ipc.Client.call(Client.java:1411)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1364)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy13.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:264)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy14.create(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1612)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1488)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1413)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:387)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:383)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:383)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:327)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
>   at 
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>   at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at 

Re: Using saveAsNewAPIHadoopDataset for Saving custom classes to Hbase

2016-04-22 Thread Ted Yu
Which hbase release are you using ?

Below is the write method from hbase 1.1 :

public void write(KEY key, Mutation value)
throws IOException {
  if (!(value instanceof Put) && !(value instanceof Delete)) {
throw new IOException("Pass a Delete or a Put");
  }
  mutator.mutate(value);

Mutation is an hbase class:

public abstract class Mutation extends OperationWithAttributes implements
Row, CellScannable,

HeapSize {

If you can show the skeleton of CustomClass, that would give us more clue.

>From the exception, looks like CustomClass doesn't extend Mutation.

A Mutation object can modify multiple columns.

Cheers

On Fri, Apr 22, 2016 at 8:14 PM, Nkechi Achara 
wrote:

> Hi All,
>
> I ma having a few issues saving my data to Hbase.
>
> I have created a pairRDD for my custom class using the following:
>
> val rdd1 =rdd.map{it=>
>   (getRowKey(it),
>   it)
> }
>
> val job = Job.getInstance(hConf)
> val jobConf = job.getConfiguration
> jobConf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
> job.setOutputFormatClass(classOf[TableOutputFormat[CustomClass]])
>
> rdd1.saveAsNewAPIHadoopDataset(jobConf)
>
> When I run it, I receive the error:
>
> ava.lang.ClassCastException: com.test.CustomClass cannot be cast to
> org.apache.hadoop.hbase.client.Mutation
> at
> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:85)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1036)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1205)
>
> Has anyone got a concrete example of how to use this function?
> Also, does anyone know what it will actually save to Hbase, will it just
> be a single column for the CustomClass?
>
> Thanks,
>
> Keech
>


Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
The class is private :

final class OffsetRange private(

On Fri, Apr 22, 2016 at 4:08 PM, Mich Talebzadeh 
wrote:

> Ok I decided to forgo that approach and use an existing program of mine
> with slight modification. The code is this
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
> import _root_.kafka.serializer.StringDecoder
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka.KafkaUtils
> import org.apache.spark.streaming.kafka.{KafkaUtils, OffsetRange}
> //
> object CEP_assembly {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>setAppName("CEP_assembly").
>setMaster("local[2]").
>set("spark.driver.allowMultipleContexts", "true").
>set("spark.hadoop.validateOutputSpecs", "false")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
> val ssc = new StreamingContext(conf, Seconds(1))
> ssc.checkpoint("checkpoint")
> val kafkaParams = Map[String, String]("bootstrap.servers" ->
> "rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
> "zookeeper.connect" -> "rhes564:2181", "group.id" -> "StreamTest" )
> val topics = Set("newtopic", "newtopic")
> val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
> StringDecoder](ssc, kafkaParams, topics)
> dstream.cache()
> val lines = dstream.map(_._2)
> val showResults = lines.filter(_.contains("statement cache")).flatMap(line
> => line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
> // Define the offset ranges to read in the batch job
> val offsetRanges = new OffsetRange("newtopic", 0, 110, 220)
> // Create the RDD based on the offset ranges
> val rdd = KafkaUtils.createRDD[String, String, StringDecoder,
> StringDecoder](sc, kafkaParams, offsetRanges)
> ssc.start()
> ssc.awaitTermination()
> //ssc.stop()
>   println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   }
> }
>
>
> With sbt
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"  %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" %
> "provided"
> libraryDependencies += "junit" % "junit" % "4.12"
> libraryDependencies += "org.scala-sbt" % "test-interface" % "1.0"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
> "1.6.1"
> libraryDependencies += "org.scalactic" %% "scalactic" % "2.2.6"
> libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6"
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.5.1"
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.5.1" %
> "test"
> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" %
> "1.6.1"
>
>
> However, I an getting the following error
>
> [info] Loading project definition from
> /data6/hduser/scala/CEP_assembly/project
> [info] Set current project to CEP_assembly (in build
> file:/data6/hduser/scala/CEP_assembly/)
> [info] Compiling 1 Scala source to
> /data6/hduser/scala/CEP_assembly/target/scala-2.10/classes...
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:37:
> constructor OffsetRange in class OffsetRange cannot be accessed in object
> CEP_assembly
> [error] val offsetRanges = new OffsetRange("newtopic", 0, 110, 220)
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 April 2016 at 18:41, Marcelo Vanzin  wrote:
>
>> On Fri, Apr 22, 2016 at 10:38 AM, Mich Talebzadeh
>>  wrote:
>> > I am trying to test Spark with CEP and I have been shown a sample here
>> >
>> https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532
>>
>> I'm not familiar with CEP, but that's a Spark unit test, so if you're
>> trying to run it outside of the context of Spark unit tests (as it
>> seems you're trying to do), you're going to run into a world of
>> trouble. I'd suggest a different approach where 

Re: How this unit test passed on master trunk?

2016-04-22 Thread Ted Yu
This was added by Xiao through:

[SPARK-13320][SQL] Support Star in CreateStruct/CreateArray and Error
Handling when DataFrame/DataSet Functions using Star

I tried in spark-shell and got:

scala> val first =
structDf.groupBy($"a").agg(min(struct($"record.*"))).first()
first: org.apache.spark.sql.Row = [1,[1,1]]

BTW
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.7/715/consoleFull
shows this test passing.

On Fri, Apr 22, 2016 at 11:23 AM, Yong Zhang  wrote:

> Hi,
>
> I was trying to find out why this unit test can pass in Spark code.
>
> in
>
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
>
> for this unit test:
>
>   test("Star Expansion - CreateStruct and CreateArray") {
> val structDf = testData2.select("a", "b").as("record")
> // CreateStruct and CreateArray in aggregateExpressions
> *assert(structDf.groupBy($"a").agg(min(struct($"record.*"))).first() == 
> Row(3, Row(3, 1)))*
> assert(structDf.groupBy($"a").agg(min(array($"record.*"))).first() == 
> Row(3, Seq(3, 1)))
>
> // CreateStruct and CreateArray in project list (unresolved alias)
> assert(structDf.select(struct($"record.*")).first() == Row(Row(1, 1)))
> assert(structDf.select(array($"record.*")).first().getAs[Seq[Int]](0) === 
> Seq(1, 1))
>
> // CreateStruct and CreateArray in project list (alias)
> assert(structDf.select(struct($"record.*").as("a")).first() == Row(Row(1, 
> 1)))
> 
> assert(structDf.select(array($"record.*").as("a")).first().getAs[Seq[Int]](0) 
> === Seq(1, 1))
>   }
>
> From my understanding, the data return in this case should be Row(1, Row(1, 
> 1]), as that will be min of struct.
>
> In fact, if I run the spark-shell on my laptop, and I got the result I 
> expected:
>
>
> ./bin/spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>   /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> case class TestData2(a: Int, b: Int)
> defined class TestData2
>
> scala> val testData2DF = sqlContext.sparkContext.parallelize(TestData2(1,1) 
> :: TestData2(1,2) :: TestData2(2,1) :: TestData2(2,2) :: TestData2(3,1) :: 
> TestData2(3,2) :: Nil, 2).toDF()
>
> scala> val structDF = testData2DF.select("a","b").as("record")
>
> scala> structDF.groupBy($"a").agg(min(struct($"record.*"))).first()
> res0: org.apache.spark.sql.Row = [1,[1,1]]
>
> scala> structDF.show
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  1|
> |  1|  2|
> |  2|  1|
> |  2|  2|
> |  3|  1|
> |  3|  2|
> +---+---+
>
> So from my spark, which I built on the master, I cannot get Row[3,[1,1]] back 
> in this case. Why the unit test asserts that Row[3,[1,1]] should be the 
> first, and it will pass? But I cannot reproduce that in my spark-shell? I am 
> trying to understand how to interpret the meaning of 
> "agg(min(struct($"record.*")))"
>
>
> Thanks
>
> Yong
>
>


Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Marcelo:
>From yesterday's thread, Mich revealed that he was looking at:

https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala

which references SparkFunSuite.

In an earlier thread, Mich was asking about CEP.

Just to give you some background.

On Fri, Apr 22, 2016 at 10:31 AM, Marcelo Vanzin 
wrote:

> Sorry, I've been looking at this thread and the related ones and one
> thing I still don't understand is: why are you trying to use internal
> Spark classes like Logging and SparkFunSuite in your code?
>
> Unless you're writing code that lives inside Spark, you really
> shouldn't be trying to reference them. First reason being that they're
> "private[spark]" and even if they're available, the compiler won't let
> you.
>
> On Fri, Apr 22, 2016 at 12:21 AM, Mich Talebzadeh
>  wrote:
> >
> > Hi,
> >
> > Anyone know which jar file has  import org.apache.spark.internal.Logging?
> >
> > I tried spark-core_2.10-1.5.1.jar
> >
> > but does not seem to work
> >
> > scala> import org.apache.spark.internal.Logging
> >
> > :57: error: object internal is not a member of package
> > org.apache.spark
> >  import org.apache.spark.internal.Logging
> >
> > Thanks
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
eep(in: Tick, ew: WindowState): Boolean = {
> [error]^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:592:
> not found: type WindowState
> [error] val predicateMapping: Map[String, (Tick, WindowState) =>
> Boolean] =
> [error]  ^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:595:
> value patternMatchByKeyAndWindow is not a member of
> org.apache.spark.streaming.dstream.DStream[(String, Tick)]
> [error] val matches = kvTicks.patternMatchByKeyAndWindow("rise drop
> [rise ]+ deep".r,
> [error]   ^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:646:
> not found: type WindowState
> [error] def rise(in: Tick, ew: WindowState): Boolean = {
> [error]^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:650:
> not found: type WindowState
> [error] def drop(in: Tick, ew: WindowState): Boolean = {
> [error]^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:654:
> not found: type WindowState
> [error] def deep(in: Tick, ew: WindowState): Boolean = {
> [error]^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:660:
> not found: type WindowState
> [error] val predicateMapping: Map[String, (Tick, WindowState) =>
> Boolean] =
> [error]  ^
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:663:
> value patternMatchByWindow is not a member of
> org.apache.spark.streaming.dstream.DStream[(Long, Tick)]
> [error] val matches = kvTicks.patternMatchByWindow("rise drop [rise ]+
> deep".r,
> [error]   ^
> [error] 22 errors found
> [error] (compile:compileIncremental) Compilation failed
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 April 2016 at 14:53, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Normally Logging would be included in spark-shell session since
>> spark-core jar is imported by default:
>>
>> scala> import org.apache.spark.internal.Logging
>> import org.apache.spark.internal.Logging
>>
>> See this JIRA:
>>
>> [SPARK-13928] Move org.apache.spark.Logging into
>> org.apache.spark.internal.Logging
>>
>> In 1.6.x release, Logging was at org.apache.spark.Logging
>>
>> FYI
>>
>> On Fri, Apr 22, 2016 at 12:21 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> Anyone know which jar file has  import org.apache.spark.internal.Logging?
>>>
>>> I tried *spark-core_2.10-1.5.1.jar *
>>>
>>> but does not seem to work
>>>
>>> scala> import org.apache.spark.internal.Logging
>>>
>>> :57: error: object internal is not a member of package
>>> org.apache.spark
>>>  import org.apache.spark.internal.Logging
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>


Re: Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread Ted Yu
There is not much in the body of email.

Can you elaborate what issue you encountered ?

Thanks

On Fri, Apr 22, 2016 at 2:27 AM, Rowson, Andrew G. (TR Technology & Ops) <
andrew.row...@thomsonreuters.com> wrote:

>
> 
>
> This e-mail is for the sole use of the intended recipient and contains
> information that may be privileged and/or confidential. If you are not an
> intended recipient, please notify the sender by return e-mail and delete
> this e-mail and any attachments. Certain required legal entity disclosures
> can be accessed on our website.<
> http://site.thomsonreuters.com/site/disclosures/>
>
>
> -- Forwarded message --
> From: "Rowson, Andrew G. (TR Technology & Ops)" <
> andrew.row...@thomsonreuters.com>
> To: "user@spark.apache.org" 
> Cc:
> Date: Fri, 22 Apr 2016 10:27:53 +0100
> Subject: Custom Log4j layout on YARN = ClassNotFoundException
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Which jar file has import org.apache.spark.internal.Logging

2016-04-22 Thread Ted Yu
Normally Logging would be included in spark-shell session since spark-core
jar is imported by default:

scala> import org.apache.spark.internal.Logging
import org.apache.spark.internal.Logging

See this JIRA:

[SPARK-13928] Move org.apache.spark.Logging into
org.apache.spark.internal.Logging

In 1.6.x release, Logging was at org.apache.spark.Logging

FYI

On Fri, Apr 22, 2016 at 12:21 AM, Mich Talebzadeh  wrote:

>
> Hi,
>
> Anyone know which jar file has  import org.apache.spark.internal.Logging?
>
> I tried *spark-core_2.10-1.5.1.jar *
>
> but does not seem to work
>
> scala> import org.apache.spark.internal.Logging
>
> :57: error: object internal is not a member of package
> org.apache.spark
>  import org.apache.spark.internal.Logging
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
For you, it should be spark-core_2.10-1.5.1.jar

Please replace version of Spark in my example with the version you use.

On Thu, Apr 21, 2016 at 1:23 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Ted
>
> I cannot see spark-core_2.11-2.0.0-SNAPSHOT.jar  under
>
> https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/
>
> Sorry where are these artefacts please?
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 20:24, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Plug in 1.5.1 for your jars:
>>
>> $ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT.jar | grep Logging
>> ...
>>   1781 Thu Apr 21 08:19:34 PDT 2016
>> org/apache/spark/internal/Logging$.class
>>
>> jar tvf
>> external/kafka/target/spark-streaming-kafka_2.11-2.0.0-SNAPSHOT.jar | grep
>> LeaderOffset
>> ...
>>   3310 Thu Apr 21 08:38:40 PDT 2016
>> org/apache/spark/streaming/kafka/KafkaCluster$LeaderOffset.class
>>
>> On Thu, Apr 21, 2016 at 11:52 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> These two are giving me griefs:
>>>
>>> scala> import org.apache.spark.internal.Logging
>>> :26: error: object internal is not a member of package
>>> org.apache.spark
>>>  import org.apache.spark.internal.Logging
>>>
>>> scala> import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
>>> :29: error: object KafkaCluster in package kafka cannot be
>>> accessed in package org.apache.spark.streaming.kafka
>>>  import
>>> org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 April 2016 at 18:29, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Thanks
>>>>
>>>> jar tvf spark-core_2.10-1.5.1-tests.jar | grep SparkFunSuite
>>>>   1787 Wed Sep 23 23:34:26 BST 2015
>>>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>>>   1780 Wed Sep 23 23:34:26 BST 2015
>>>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>>>   3982 Wed Sep 23 23:34:26 BST 2015 org/apache/spark/SparkFunSuite.class
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 April 2016 at 18:21, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Please replace version number for the release you are using :
>>>>>
>>>>> spark-core_2.10-1.5.1-tests.jar
>>>>>
>>>>> On Thu, Apr 21, 2016 at 10:18 AM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> I don't seem to be able to locate
>>>>>> spark-core_2.11-2.0.0-SNAPSHOT-tests.jar file :(
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 21 April 2016 at 17:45, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> It is in core-XX-

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Plug in 1.5.1 for your jars:

$ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT.jar | grep Logging
...
  1781 Thu Apr 21 08:19:34 PDT 2016 org/apache/spark/internal/Logging$.class

jar tvf external/kafka/target/spark-streaming-kafka_2.11-2.0.0-SNAPSHOT.jar
| grep LeaderOffset
...
  3310 Thu Apr 21 08:38:40 PDT 2016
org/apache/spark/streaming/kafka/KafkaCluster$LeaderOffset.class

On Thu, Apr 21, 2016 at 11:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> These two are giving me griefs:
>
> scala> import org.apache.spark.internal.Logging
> :26: error: object internal is not a member of package
> org.apache.spark
>  import org.apache.spark.internal.Logging
>
> scala> import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
> :29: error: object KafkaCluster in package kafka cannot be
> accessed in package org.apache.spark.streaming.kafka
>  import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 18:29, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> Thanks
>>
>> jar tvf spark-core_2.10-1.5.1-tests.jar | grep SparkFunSuite
>>   1787 Wed Sep 23 23:34:26 BST 2015
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>   1780 Wed Sep 23 23:34:26 BST 2015
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>   3982 Wed Sep 23 23:34:26 BST 2015 org/apache/spark/SparkFunSuite.class
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 April 2016 at 18:21, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Please replace version number for the release you are using :
>>>
>>> spark-core_2.10-1.5.1-tests.jar
>>>
>>> On Thu, Apr 21, 2016 at 10:18 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> I don't seem to be able to locate
>>>> spark-core_2.11-2.0.0-SNAPSHOT-tests.jar file :(
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 April 2016 at 17:45, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> It is in core-XX-tests jar:
>>>>>
>>>>> $ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT-tests.jar |
>>>>> grep SparkFunSuite
>>>>>   1830 Thu Apr 21 08:19:14 PDT 2016
>>>>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>>>>   1823 Thu Apr 21 08:19:14 PDT 2016
>>>>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>>>>   6232 Thu Apr 21 08:19:14 PDT 2016
>>>>> org/apache/spark/SparkFunSuite.class
>>>>>
>>>>> On Thu, Apr 21, 2016 at 9:39 AM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> like war of attrition :)
>>>>>>
>>>>>> now I get with sbt
>>>>>>
>>>>>> object SparkFunSuite is not a member of package org.apache.spark
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 21 April 2016 at 17:22, Ted Yu

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Please replace version number for the release you are using :

spark-core_2.10-1.5.1-tests.jar

On Thu, Apr 21, 2016 at 10:18 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> I don't seem to be able to locate spark-core_2.11-2.0.0-SNAPSHOT-tests.jar
> file :(
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 17:45, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> It is in core-XX-tests jar:
>>
>> $ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT-tests.jar | grep
>> SparkFunSuite
>>   1830 Thu Apr 21 08:19:14 PDT 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>   1823 Thu Apr 21 08:19:14 PDT 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>   6232 Thu Apr 21 08:19:14 PDT 2016 org/apache/spark/SparkFunSuite.class
>>
>> On Thu, Apr 21, 2016 at 9:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> like war of attrition :)
>>>
>>> now I get with sbt
>>>
>>> object SparkFunSuite is not a member of package org.apache.spark
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 April 2016 at 17:22, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Have you tried the following ?
>>>>
>>>> libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6"
>>>>
>>>> On Thu, Apr 21, 2016 at 9:19 AM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Unfortunately this sbt dependency is not working
>>>>>
>>>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" %
>>>>> "provided"
>>>>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"  %
>>>>> "provided"
>>>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" %
>>>>> "provided"
>>>>> libraryDependencies += "junit" % "junit" % "4.12"
>>>>> libraryDependencies += "org.scala-sbt" % "test-interface" % "1.0"
>>>>> libraryDependencies += "org.apache.spark" %% "spark-streaming" %
>>>>> "1.6.1" % "provided"
>>>>> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
>>>>> "1.6.1"
>>>>> libraryDependencies += "org.scalactic" %% "scalactic" % "2.2.6"
>>>>> *libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" %
>>>>> "test"*
>>>>>
>>>>> Getting error
>>>>>
>>>>> [info] Compiling 1 Scala source to
>>>>> /data6/hduser/scala/CEP_assembly/target/scala-2.10/classes...
>>>>> [error]
>>>>> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:32:
>>>>> object scalatest is not a member of package org
>>>>> [error] import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>>> [error]^
>>>>>
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 21 April 2016 at 16:49, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> wrote:
>>>>>
>>&

Re: Spark 1.6.1. How to prevent serialization of KafkaProducer

2016-04-21 Thread Ted Yu
In KafkaWordCount , the String is sent back and producer.send() is called.

I guess if you don't find via solution in your current design, you can
consider the above.

On Thu, Apr 21, 2016 at 10:04 AM, Alexander Gallego 
wrote:

> Hello,
>
> I understand that you cannot serialize Kafka Producer.
>
> So I've tried:
>
> (as suggested here
> https://forums.databricks.com/questions/369/how-do-i-handle-a-task-not-serializable-exception.html
> )
>
>  - Make the class Serializable - not possible
>
>  - Declare the instance only within the lambda function passed in map.
>
> via:
>
> // as suggested by the docs
>
>
> ```scala
>
>kafkaOut.foreachRDD(rdd => {
>  rdd.foreachPartition(partition => {
>   val producer = new KafkaProducer(..)
>   partition.foreach { record =>
>   producer.send(new ProducerRecord(outputTopic, record._1,
> record._2)
>}
>   producer.close()
>})
>  }) // foreachRDD
>
>
> ```
>
> - Make the NotSerializable object as a static and create it once per
> machine.
>
> via:
>
>
> ```scala
>
>
> object KafkaSink {
>   @volatile private var instance: Broadcast[KafkaProducer[String, String]]
> = null
>   def getInstance(brokers: String, sc: SparkContext):
> Broadcast[KafkaProducer[String, String]] = {
> if (instance == null) {
>   synchronized {
> println("Creating new kafka producer")
> val props = new java.util.Properties()
> ...
> instance = sc.broadcast(new KafkaProducer[String, String](props))
> sys.addShutdownHook {
>   instance.value.close()
> }
>   }
> }
> instance
>   }
> }
>
>
> ```
>
>
>
>  - Call rdd.forEachPartition and create the NotSerializable object in
> there like this:
>
> Same as above.
>
>
> - Mark the instance @transient
>
> Same thing, just make it a class variable via:
>
>
> ```
> @transient var producer: KakfaProducer[String,String] = null
> def getInstance() = {
>if( producer == null ) {
>producer = new KafkaProducer()
>}
>producer
> }
>
> ```
>
>
> However, I get serialization problems with all of these options.
>
>
> Thanks for your help.
>
> - Alex
>
>


Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
It is in core-XX-tests jar:

$ jar tvf ./core/target/spark-core_2.11-2.0.0-SNAPSHOT-tests.jar | grep
SparkFunSuite
  1830 Thu Apr 21 08:19:14 PDT 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
  1823 Thu Apr 21 08:19:14 PDT 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
  6232 Thu Apr 21 08:19:14 PDT 2016 org/apache/spark/SparkFunSuite.class

On Thu, Apr 21, 2016 at 9:39 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> like war of attrition :)
>
> now I get with sbt
>
> object SparkFunSuite is not a member of package org.apache.spark
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 17:22, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Have you tried the following ?
>>
>> libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6"
>>
>> On Thu, Apr 21, 2016 at 9:19 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Unfortunately this sbt dependency is not working
>>>
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" %
>>> "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"  %
>>> "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" %
>>> "provided"
>>> libraryDependencies += "junit" % "junit" % "4.12"
>>> libraryDependencies += "org.scala-sbt" % "test-interface" % "1.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1"
>>> % "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
>>> "1.6.1"
>>> libraryDependencies += "org.scalactic" %% "scalactic" % "2.2.6"
>>> *libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" %
>>> "test"*
>>>
>>> Getting error
>>>
>>> [info] Compiling 1 Scala source to
>>> /data6/hduser/scala/CEP_assembly/target/scala-2.10/classes...
>>> [error]
>>> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:32:
>>> object scalatest is not a member of package org
>>> [error] import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>> [error]^
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 21 April 2016 at 16:49, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ted. It was a typo in my alias and it is sorted now
>>>>
>>>> slong='rlwrap spark-shell --master spark://50.140.197.217:7077 --jars
>>>> /home/hduser/jars/ojdbc6.jar,/home/hduser/jars/jconn4.jar,/home/hduser/jars/junit-4.12.jar,/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,/home/hduser/jars/scalatest_2.11-2.2.6.jar'
>>>>
>>>>
>>>> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 21 April 2016 at 16:44, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> I tried on refreshed copy of master branch:
>>>>>
>>>>> $ bin/spark-shell --jars
>>>>> /home/hbase/.m2/repository/org/scalatest/sca

Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Have you tried the following ?

libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6"

On Thu, Apr 21, 2016 at 9:19 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Unfortunately this sbt dependency is not working
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"  %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" %
> "provided"
> libraryDependencies += "junit" % "junit" % "4.12"
> libraryDependencies += "org.scala-sbt" % "test-interface" % "1.0"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "1.6.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
> "1.6.1"
> libraryDependencies += "org.scalactic" %% "scalactic" % "2.2.6"
> *libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"*
>
> Getting error
>
> [info] Compiling 1 Scala source to
> /data6/hduser/scala/CEP_assembly/target/scala-2.10/classes...
> [error]
> /data6/hduser/scala/CEP_assembly/src/main/scala/myPackage/CEP_assemly.scala:32:
> object scalatest is not a member of package org
> [error] import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
> [error]^
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 April 2016 at 16:49, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Thanks Ted. It was a typo in my alias and it is sorted now
>>
>> slong='rlwrap spark-shell --master spark://50.140.197.217:7077 --jars
>> /home/hduser/jars/ojdbc6.jar,/home/hduser/jars/jconn4.jar,/home/hduser/jars/junit-4.12.jar,/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,/home/hduser/jars/scalatest_2.11-2.2.6.jar'
>>
>>
>> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 21 April 2016 at 16:44, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> I tried on refreshed copy of master branch:
>>>
>>> $ bin/spark-shell --jars
>>> /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar
>>> ...
>>> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>
>>> BTW I noticed an extra leading comma after '--jars' in your email.
>>> Not sure if that matters.
>>>
>>> On Thu, Apr 21, 2016 at 8:39 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Mich:
>>>>
>>>> $ jar tvf
>>>> /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar
>>>> | grep BeforeAndAfter
>>>>   4257 Sat Dec 26 14:35:48 PST 2015
>>>> org/scalatest/BeforeAndAfter$class.class
>>>>   2602 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter.class
>>>>   1998 Sat Dec 26 14:35:48 PST 2015
>>>> org/scalatest/BeforeAndAfterAll$$anonfun$run$1.class
>>>>
>>>> FYI
>>>>
>>>> On Thu, Apr 21, 2016 at 8:35 AM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> to Mario, Alonso, Luciano, user
>>>>> Hi,
>>>>>
>>>>> Following example in
>>>>>
>>>>>
>>>>> https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532
>>>>>
>>>>> Does anyone know which jar file this belongs to?
>>>>>
>>>>> I use *scalatest_2.11-2.2.6.jar *in my spark-shell
>>>>>
>>>>>  spark-shell --master spark://50.140.197.217:7077 --jars
>>>>> ,/home/hduser/jars/junit-4.12.jar,/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,
>>>>> /*home/hduser/jars/scalatest_2.11-2.2.6.jar'*
>>>>>
>>>>> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>>> :28: error: object scalatest is not a member of package org
>>>>>  import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>>>>
>>>>> Thanks
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
I tried on refreshed copy of master branch:

$ bin/spark-shell --jars
/home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar
...
scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}

BTW I noticed an extra leading comma after '--jars' in your email.
Not sure if that matters.

On Thu, Apr 21, 2016 at 8:39 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Mich:
>
> $ jar tvf
> /home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar
> | grep BeforeAndAfter
>   4257 Sat Dec 26 14:35:48 PST 2015
> org/scalatest/BeforeAndAfter$class.class
>   2602 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter.class
>   1998 Sat Dec 26 14:35:48 PST 2015
> org/scalatest/BeforeAndAfterAll$$anonfun$run$1.class
>
> FYI
>
> On Thu, Apr 21, 2016 at 8:35 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> to Mario, Alonso, Luciano, user
>> Hi,
>>
>> Following example in
>>
>>
>> https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532
>>
>> Does anyone know which jar file this belongs to?
>>
>> I use *scalatest_2.11-2.2.6.jar *in my spark-shell
>>
>>  spark-shell --master spark://50.140.197.217:7077 --jars
>> ,/home/hduser/jars/junit-4.12.jar,/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,
>> /*home/hduser/jars/scalatest_2.11-2.2.6.jar'*
>>
>> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>> :28: error: object scalatest is not a member of package org
>>  import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Issue with Spark shell and scalatest

2016-04-21 Thread Ted Yu
Mich:

$ jar tvf
/home/hbase/.m2/repository/org/scalatest/scalatest_2.11/2.2.6/scalatest_2.11-2.2.6.jar
| grep BeforeAndAfter
  4257 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter$class.class
  2602 Sat Dec 26 14:35:48 PST 2015 org/scalatest/BeforeAndAfter.class
  1998 Sat Dec 26 14:35:48 PST 2015
org/scalatest/BeforeAndAfterAll$$anonfun$run$1.class

FYI

On Thu, Apr 21, 2016 at 8:35 AM, Mich Talebzadeh 
wrote:

> to Mario, Alonso, Luciano, user
> Hi,
>
> Following example in
>
>
> https://github.com/agsachin/spark/blob/CEP/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala#L532
>
> Does anyone know which jar file this belongs to?
>
> I use *scalatest_2.11-2.2.6.jar *in my spark-shell
>
>  spark-shell --master spark://50.140.197.217:7077 --jars
> ,/home/hduser/jars/junit-4.12.jar,/home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar,
> /*home/hduser/jars/scalatest_2.11-2.2.6.jar'*
>
> scala> import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
> :28: error: object scalatest is not a member of package org
>  import org.scalatest.{BeforeAndAfter, BeforeAndAfterAll}
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Save DataFrame to HBase

2016-04-21 Thread Ted Yu
The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can
do this.

On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim  wrote:

> Has anyone found an easy way to save a DataFrame into HBase?
>
> Thanks,
> Ben
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: RDD generated from Dataframes

2016-04-21 Thread Ted Yu
In upcoming 2.0 release, the signature for map() has become:

  def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan {

Note: DataFrame and DataSet are unified in 2.0

FYI

On Thu, Apr 21, 2016 at 6:49 AM, Apurva Nandan  wrote:

> Hello everyone,
>
> Generally speaking, I guess it's well known that dataframes are much
> faster than RDD when it comes to performance.
> My question is how do you go around when it comes to transforming a
> dataframe using map.
> I mean then the dataframe gets converted into RDD, hence now do you again
> convert this RDD to a new dataframe for better performance?
> Further, if you have a process which involves series of transformations
> i.e. from one RDD to another, do you keep on converting each RDD to a
> dataframe first, all the time?
>
> It's also possible that I might be missing something here, please share
> your experiences.
>
>
> Thanks and Regards,
> Apurva
>


Re: StructField Translation Error with Spark SQL

2016-04-21 Thread Ted Yu
You meant for fields which are nullable.

Can you pastebin the complete stack trace ?

Try 1.6.1 when you have a chance.

Thanks

On Wed, Apr 20, 2016 at 10:20 PM, Charles Nnamdi Akalugwu <
cprenzb...@gmail.com> wrote:

> I get the same error for fields which are not null unfortunately.
>
> Can't translate null value for field
> StructField(density,DecimalType(4,2),true)
> On Apr 21, 2016 1:37 AM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
>> The weight field is not nullable.
>>
>> Looks like your source table had null value for this field.
>>
>> On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu <
>> cprenzb...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using spark 1.4.1 and trying to copy all rows from a table in one
>>> MySQL Database to a Amazon RDS table using spark SQL.
>>>
>>> Some columns in the source table are defined as DECIMAL type and are
>>> nullable. Others are not.  When I run my spark job,
>>>
>>> val writeData = sqlContext.read.format("jdbc").option("url",
>>>>> sourceUrl).option("driver", "com.mysql.jdbc.Driver").option("dbtable",
>>>>> table).option("user", sourceUsername).option("password",
>>>>> sourcePassword).load()
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> writeData.write.format("com.databricks.spark.redshift").option("url",
>>>>> String.format(targetUrl, targetUsername, 
>>>>> targetPassword)).option("dbtable",
>>>>> table).option("tempdir", redshiftTempDir+table).mode("append").save()
>>>>
>>>>
>>> it fails with the following exception
>>>
>>> Can't translate null value for field
>>>> StructField(weight,DecimalType(5,2),false)
>>>
>>>
>>> Any insights about this exception would be very helpful.
>>>
>>
>>


Re: StructField Translation Error with Spark SQL

2016-04-20 Thread Ted Yu
The weight field is not nullable.

Looks like your source table had null value for this field.

On Wed, Apr 20, 2016 at 4:11 PM, Charles Nnamdi Akalugwu <
cprenzb...@gmail.com> wrote:

> Hi,
>
> I am using spark 1.4.1 and trying to copy all rows from a table in one
> MySQL Database to a Amazon RDS table using spark SQL.
>
> Some columns in the source table are defined as DECIMAL type and are
> nullable. Others are not.  When I run my spark job,
>
> val writeData = sqlContext.read.format("jdbc").option("url",
>>> sourceUrl).option("driver", "com.mysql.jdbc.Driver").option("dbtable",
>>> table).option("user", sourceUsername).option("password",
>>> sourcePassword).load()
>>
>>
>>
>>
>>
>> writeData.write.format("com.databricks.spark.redshift").option("url",
>>> String.format(targetUrl, targetUsername, targetPassword)).option("dbtable",
>>> table).option("tempdir", redshiftTempDir+table).mode("append").save()
>>
>>
> it fails with the following exception
>
> Can't translate null value for field
>> StructField(weight,DecimalType(5,2),false)
>
>
> Any insights about this exception would be very helpful.
>


Re: Invoking SparkR from Spark shell

2016-04-20 Thread Ted Yu
Please take a look at:
https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes

On Wed, Apr 20, 2016 at 9:50 AM, Ashok Kumar 
wrote:

> Hi,
>
> I have Spark 1.6.1 but I do bot know how to invoke SparkR so I can use R
> with Spark.
>
> Is there a s hell similar to spark-shell that supports R besides Scala
> please?
>
>
> Thanks
>


Re: 回复:Spark sql and hive into different result with same sql

2016-04-20 Thread Ted Yu
Do you mind trying out build from master branch ?

1.5.3 is a bit old.

On Wed, Apr 20, 2016 at 5:25 AM, FangFang Chen 
wrote:

> I found spark sql lost precision, and handle data as int with some rule.
> Following is data got via hive shell and spark sql, with same sql to same
> hive table:
> Hive:
> 0.4
> 0.5
> 1.8
> 0.4
> 0.49
> 1.5
> Spark sql:
> 1
> 2
> 2
> Seems the handle rule is: when decimal point data <0.5 then to 0, when
> decimal point data>=0.5 then to 1.
>
> Is this a bug or some configuration thing? Please give some suggestions.
> Thanks
>
> 发自 网易邮箱大师 
> 在2016年04月20日 18:45,FangFang Chen  写道:
>
> The output is:
> Spark SQ:6828127
> Hive:6980574.1269
>
> 发自 网易邮箱大师 
> 在2016年04月20日 18:06,FangFang Chen  写道:
>
> Hi all,
> Please give some suggestions. Thanks
>
> With following same sql, spark sql and hive give different result. The sql
> is sum(decimal(38,18)) columns.
> Select sum(column) from table;
> column is defined as decimal(38,18).
>
> Spark version:1.5.3
> Hive version:2.0.0
>
> 发自 网易邮箱大师 
>
>
>
>
>
>
>


Re: Why very small work load cause GC overhead limit?

2016-04-19 Thread Ted Yu
Can you tell us the memory parameters you used ?

If you can capture jmap before the GC limit was exceeded, that would give us 
more clue. 

Thanks

> On Apr 19, 2016, at 7:40 PM, "kramer2...@126.com"  wrote:
> 
> Hi All
> 
> I use spark doing some calculation. 
> The situation is 
> 1. New file will come into a folder periodically
> 2. I turn the new files into data frame and insert it into an previous data
> frame.
> 
> The code is like below :
> 
> 
># Get the file list in the HDFS directory
>client = InsecureClient('http://10.79.148.184:50070')
>file_list = client.list('/test')
> 
>df_total = None
>counter = 0
>for file in file_list:
>counter += 1
> 
># turn each file (CSV format) into data frame
>lines = sc.textFile("/test/%s" % file)
>parts = lines.map(lambda l: l.split(","))
>rows = parts.map(lambda p: Row(router=p[0], interface=int(p[1]),
> protocol=p[7],bit=int(p[10])))
>df = sqlContext.createDataFrame(rows)
> 
># do some transform on the data frame
>df_protocol =
> df.groupBy(['protocol']).agg(func.sum('bit').alias('bit'))
> 
># add the current data frame to previous data frame set
>if not df_total:
>df_total = df_protocol
>else:
>df_total = df_total.unionAll(df_protocol)
> 
># cache the df_total
>df_total.cache()
>if counter % 5 == 0:
>df_total.rdd.checkpoint()
> 
># get the df_total information
>df_total.show()
> 
> 
> I know that as time goes on, the df_total could be big. But actually, before
> that time come, the above code already raise exception.
> 
> When the loop is about 30 loops. The code throw GC overhead limit exceeded
> exception. The file is very small so even 300 loops the data size could only
> be about a few MB. I do not know why it throw GC error.
> 
> The exception detail is below :
> 
>16/04/19 15:54:55 ERROR Utils: Uncaught exception in thread
> task-result-getter-2
>java.lang.OutOfMemoryError: GC overhead limit exceeded
>at
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:328)
>at scala.collection.immutable.HashMap.updated(HashMap.scala:54)
>at
> scala.collection.immutable.HashMap$SerializationProxy.readObject(HashMap.scala:516)
>at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>at java.lang.reflect.Method.invoke(Method.java:606)
>at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>at 
> java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
>at
> org.apache.spark.executor.TaskMetrics$$anonfun$readObject$1.apply$mcV$sp(TaskMetrics.scala:220)
>at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>at java.lang.reflect.Method.invoke(Method.java:606)
>at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>at
> org.apache.spark.scheduler.DirectTaskResult$$anonfun$readExternal$1.apply$mcV$sp(TaskResult.scala:79)
>at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>at
> org.apache.spark.scheduler.DirectTaskResult.readExternal(TaskResult.scala:62)
>at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
>Exception in thread "task-result-getter-2" java.lang.OutOfMemoryError: GC
> overhead limit exceeded
>at
> 

Re: Spark streaming batch time displayed is not current system time but it is processing current messages

2016-04-19 Thread Ted Yu
Using
http://www.ruddwire.com/handy-code/date-to-millisecond-calculators/#.VxZh3iMrKuo
, 1460823008000 is shown to be 'Sat Apr 16 2016 09:10:08 GMT-0700'

Can you clarify the 4 day difference ?

bq. 'right now April 14th'

The date of your email was Apr 16th.

On Sat, Apr 16, 2016 at 9:39 AM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Can anyone help me in debugging  this issue please.
>
>
> On Thu, Apr 14, 2016 at 12:24 PM, Hemalatha A <
> hemalatha.amru...@googlemail.com> wrote:
>
>> Hi,
>>
>> I am facing a problem in Spark streaming.
>>
>
>
> Time: 1460823006000 ms
> ---
>
> ---
> Time: 1460823008000 ms
> ---
>
>
>
>
>> The time displayed in Spark streaming console as above is 4 days prior
>> i.e.,  April 10th, which is not current system time of the cluster  but the
>> job is processing current messages that is pushed right now April 14th.
>>
>> Can anyone please advice what time does Spark streaming display? Also,
>> when there  is scheduling delay of say 8 hours, what time does Spark
>> display- current rime or   hours behind?
>>
>> --
>>
>>
>> Regards
>> Hemalatha
>>
>
>
>
> --
>
>
> Regards
> Hemalatha
>


Re: hbaseAdmin tableExists create catalogTracker for every call

2016-04-19 Thread Ted Yu
The CatalogTracker object may not be used by all the methods of HBaseAdmin.

Meaning, when HBaseAdmin is constructed, we don't need CatalogTracker.

On Tue, Apr 19, 2016 at 6:09 AM, WangYQ  wrote:

> in hbase 0.98.10,  class   "HBaseAdmin "
> line  303,  method  "tableExists",   will create a catalogTracker for
> every call
>
>
> we can let a HBaseAdmin object use one CatalogTracker object, to reduce
> the object create, connect zk and so on
>
>
>
>


Re: Logging in executors

2016-04-18 Thread Ted Yu
Looking through this thread, I don't see Spark version you use.

Can you tell us the Spark release ?

Thanks

On Mon, Apr 18, 2016 at 6:32 AM, Carlos Rojas Matas <cma...@despegar.com>
wrote:

> Thanks Ted, already checked it but is not the same. I'm working with
> StandAlone spark, the examples refers to HDFS paths, therefore I assume
> Hadoop 2 Resource Manager is used. I've tried all possible flavours. The
> only one that worked was changing the spark-defaults.conf in every machine.
> I'll go with this by now, but the extra java opts for the executor are
> definitely not working, at least for logging configuration.
>
> Thanks,
> -carlos.
>
> On Fri, Apr 15, 2016 at 3:28 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1
>>
>> On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas <cma...@despegar.com>
>> wrote:
>>
>>> Hi guys,
>>>
>>> any clue on this? Clearly the
>>> spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
>>> executors.
>>>
>>> Thanks,
>>> -carlos.
>>>
>>> On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas <cma...@despegar.com
>>> > wrote:
>>>
>>>> Hi Yong,
>>>>
>>>> thanks for your response. As I said in my first email, I've tried both
>>>> the reference to the classpath resource (env/dev/log4j-executor.properties)
>>>> as the file:// protocol. Also, the driver logging is working fine and I'm
>>>> using the same kind of reference.
>>>>
>>>> Below the content of my classpath:
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> Plus this is the content of the exploded fat jar assembled with sbt
>>>> assembly plugin:
>>>>
>>>> [image: Inline image 2]
>>>>
>>>>
>>>> This folder is at the root level of the classpath.
>>>>
>>>> Thanks,
>>>> -carlos.
>>>>
>>>> On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang <java8...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Is the env/dev/log4j-executor.properties file within your jar file? Is
>>>>> the path matching with what you specified as
>>>>> env/dev/log4j-executor.properties?
>>>>>
>>>>> If you read the log4j document here:
>>>>> https://logging.apache.org/log4j/1.2/manual.html
>>>>>
>>>>> When you specify the log4j.configuration=my_custom.properties, you
>>>>> have 2 option:
>>>>>
>>>>> 1) the my_custom.properties has to be in the jar (or in the
>>>>> classpath). In your case, since you specify the package path, you need to
>>>>> make sure they are matched in your jar file
>>>>> 2) use like log4j.configuration=file:///tmp/my_custom.properties. In
>>>>> this way, you need to make sure file my_custom.properties exists in /tmp
>>>>> folder on ALL of your worker nodes.
>>>>>
>>>>> Yong
>>>>>
>>>>> --
>>>>> Date: Wed, 13 Apr 2016 14:18:24 -0300
>>>>> Subject: Re: Logging in executors
>>>>> From: cma...@despegar.com
>>>>> To: yuzhih...@gmail.com
>>>>> CC: user@spark.apache.org
>>>>>
>>>>>
>>>>> Thanks for your response Ted. You're right, there was a typo. I
>>>>> changed it, now I'm executing:
>>>>>
>>>>> bin/spark-submit --master spark://localhost:7077 --conf
>>>>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
>>>>> --conf
>>>>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
>>>>> --class
>>>>>
>>>>> The content of this file is:
>>>>>
>>>>> # Set everything to be logged to the console
>>>>> log4j.rootCategory=INFO, FILE
>>>>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>>>>> log4j.appender.console.target=System.err
>>>>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>>>>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss}
>>>>> %p %c{1}: %m%n
>>>>>
>>>>> log4j.appender.FILE=org.apache.log4j.RollingFileAppender
>>>>> log4j.a

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Thanks Josh.

I downloaded spark-1.6.1-bin-hadoop2.3.tgz
and spark-1.6.1-bin-hadoop2.4.tgz which expand without error.

On Sat, Apr 16, 2016 at 4:54 PM, Josh Rosen <joshro...@databricks.com>
wrote:

> Using a different machine / toolchain, I've downloaded and re-uploaded all
> of the 1.6.1 artifacts to that S3 bucket, so hopefully everything should be
> working now. Let me know if you still encounter any problems with
> unarchiving.
>
> On Sat, Apr 16, 2016 at 3:10 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Pardon me - there is no tarball for hadoop 2.7
>>
>> I downloaded
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> and successfully expanded it.
>>
>> FYI
>>
>> On Sat, Apr 16, 2016 at 2:52 PM, Jon Gregg <jonrgr...@gmail.com> wrote:
>>
>>> That link points to hadoop2.6.tgz.  I tried changing the URL to
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
>>> and I get a NoSuchKey error.
>>>
>>> Should I just go with it even though it says hadoop2.6?
>>>
>>> On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> BTW this was the original thread:
>>>>
>>>> http://search-hadoop.com/m/q3RTt0Oxul0W6Ak
>>>>
>>>> The link for spark-1.6.1-bin-hadoop2.7 is
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
>>>> <https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz>
>>>>
>>>> On Sat, Apr 16, 2016 at 2:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> From the output you posted:
>>>>> ---
>>>>> Unpacking Spark
>>>>>
>>>>> gzip: stdin: not in gzip format
>>>>> tar: Child returned status 1
>>>>> tar: Error is not recoverable: exiting now
>>>>> ---
>>>>>
>>>>> The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt.
>>>>>
>>>>> This problem has been reported in other threads.
>>>>>
>>>>> Try spark-1.6.1-bin-hadoop2.7 - the artifact should be good.
>>>>>
>>>>> On Sat, Apr 16, 2016 at 2:09 PM, YaoPau <jonrgr...@gmail.com> wrote:
>>>>>
>>>>>> I launched a cluster with: "./spark-ec2 --key-pair my_pem
>>>>>> --identity-file
>>>>>> ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone
>>>>>> cluster started at
>>>>>> http://ec2-54-88-249-255.compute-1.amazonaws.com:8080;
>>>>>> and "Done!" success confirmations at the end.  I confirmed on EC2
>>>>>> that 1
>>>>>> Master and 1 Slave were both launched and passed their status checks.
>>>>>>
>>>>>> But none of the Spark commands seem to work (spark-shell, pyspark,
>>>>>> etc), and
>>>>>> port 8080 isn't being used.  The full output from launching the
>>>>>> cluster is
>>>>>> below.  Any ideas what the issue is?
>>>>>>
>>>>>> >>>>>>>>>>>>>>>>>>>>>>
>>>>>> launch
>>>>>>
>>>>>> jg_spark2/Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/plugin.py:40:
>>>>>> PendingDeprecationWarning: the imp module is deprecated in favour of
>>>>>> importlib; see the module's documentation for alternative uses
>>>>>>   import imp
>>>>>>
>>>>>> /Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/provider.py:197:
>>>>>> ResourceWarning: unclosed file <_io.TextIOWrapper
>>>>>> name='/Users/jg/.aws/credentials' mode='r' encoding='UTF-8'>
>>>>>>   self.shared_credentials.load_from_path(shared_path)
>>>>>> Setting up security groups...
>>>>>> Creating security group jg_spark2-master
>>>>>> Creating security group jg_spark2-slaves
>>>>>> Searching for existing cluster jg_spark2 in region us-east-1...
>>>>>> Spark AMI: ami-5bb18832
>>>>>> Launching instances...
>>>>>> Launched 1 slave in us-east-1a, regid = r-e7d97944
>>>>>> Launched master in us-east-1a, regid = r-d3d87870
>>>>>> Waiting for AWS to propagate inst

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
Pardon me - there is no tarball for hadoop 2.7

I downloaded
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
and successfully expanded it.

FYI

On Sat, Apr 16, 2016 at 2:52 PM, Jon Gregg <jonrgr...@gmail.com> wrote:

> That link points to hadoop2.6.tgz.  I tried changing the URL to
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
> and I get a NoSuchKey error.
>
> Should I just go with it even though it says hadoop2.6?
>
> On Sat, Apr 16, 2016 at 5:37 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> BTW this was the original thread:
>>
>> http://search-hadoop.com/m/q3RTt0Oxul0W6Ak
>>
>> The link for spark-1.6.1-bin-hadoop2.7 is
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
>> <https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz>
>>
>> On Sat, Apr 16, 2016 at 2:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> From the output you posted:
>>> ---
>>> Unpacking Spark
>>>
>>> gzip: stdin: not in gzip format
>>> tar: Child returned status 1
>>> tar: Error is not recoverable: exiting now
>>> ---
>>>
>>> The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt.
>>>
>>> This problem has been reported in other threads.
>>>
>>> Try spark-1.6.1-bin-hadoop2.7 - the artifact should be good.
>>>
>>> On Sat, Apr 16, 2016 at 2:09 PM, YaoPau <jonrgr...@gmail.com> wrote:
>>>
>>>> I launched a cluster with: "./spark-ec2 --key-pair my_pem
>>>> --identity-file
>>>> ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone
>>>> cluster started at
>>>> http://ec2-54-88-249-255.compute-1.amazonaws.com:8080;
>>>> and "Done!" success confirmations at the end.  I confirmed on EC2 that 1
>>>> Master and 1 Slave were both launched and passed their status checks.
>>>>
>>>> But none of the Spark commands seem to work (spark-shell, pyspark,
>>>> etc), and
>>>> port 8080 isn't being used.  The full output from launching the cluster
>>>> is
>>>> below.  Any ideas what the issue is?
>>>>
>>>> >>>>>>>>>>>>>>>>>>>>>>
>>>> launch
>>>>
>>>> jg_spark2/Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/plugin.py:40:
>>>> PendingDeprecationWarning: the imp module is deprecated in favour of
>>>> importlib; see the module's documentation for alternative uses
>>>>   import imp
>>>>
>>>> /Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/provider.py:197:
>>>> ResourceWarning: unclosed file <_io.TextIOWrapper
>>>> name='/Users/jg/.aws/credentials' mode='r' encoding='UTF-8'>
>>>>   self.shared_credentials.load_from_path(shared_path)
>>>> Setting up security groups...
>>>> Creating security group jg_spark2-master
>>>> Creating security group jg_spark2-slaves
>>>> Searching for existing cluster jg_spark2 in region us-east-1...
>>>> Spark AMI: ami-5bb18832
>>>> Launching instances...
>>>> Launched 1 slave in us-east-1a, regid = r-e7d97944
>>>> Launched master in us-east-1a, regid = r-d3d87870
>>>> Waiting for AWS to propagate instance metadata...
>>>> Waiting for cluster to enter 'ssh-ready' state
>>>>
>>>> Warning: SSH connection error. (This could be temporary.)
>>>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>>>> SSH return code: 255
>>>> SSH output: b'ssh: connect to host
>>>> ec2-54-88-249-255.compute-1.amazonaws.com
>>>> port 22: Connection refused'
>>>>
>>>>
>>>> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
>>>> ResourceWarning: unclosed >>> family=AddressFamily.AF_INET,
>>>> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55580),
>>>> raddr=('54.239.20.1', 443)>
>>>>   self.queue.pop(0)
>>>>
>>>>
>>>> Warning: SSH connection error. (This could be temporary.)
>>>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>>>> SSH return code: 255
>>>> SSH output: b'ssh: connect to host
>>>> ec2-54-88-249-255.compute-1.amazonaws.com
>>>> port 22: Connection refused'
>>>

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
BTW this was the original thread:

http://search-hadoop.com/m/q3RTt0Oxul0W6Ak

The link for spark-1.6.1-bin-hadoop2.7 is
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.7.tgz
<https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz>

On Sat, Apr 16, 2016 at 2:14 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> From the output you posted:
> ---
> Unpacking Spark
>
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
> ---
>
> The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt.
>
> This problem has been reported in other threads.
>
> Try spark-1.6.1-bin-hadoop2.7 - the artifact should be good.
>
> On Sat, Apr 16, 2016 at 2:09 PM, YaoPau <jonrgr...@gmail.com> wrote:
>
>> I launched a cluster with: "./spark-ec2 --key-pair my_pem --identity-file
>> ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone
>> cluster started at http://ec2-54-88-249-255.compute-1.amazonaws.com:8080;
>> and "Done!" success confirmations at the end.  I confirmed on EC2 that 1
>> Master and 1 Slave were both launched and passed their status checks.
>>
>> But none of the Spark commands seem to work (spark-shell, pyspark, etc),
>> and
>> port 8080 isn't being used.  The full output from launching the cluster is
>> below.  Any ideas what the issue is?
>>
>> >>>>>>>>>>>>>>>>>>>>>>
>> launch
>>
>> jg_spark2/Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/plugin.py:40:
>> PendingDeprecationWarning: the imp module is deprecated in favour of
>> importlib; see the module's documentation for alternative uses
>>   import imp
>>
>> /Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/provider.py:197:
>> ResourceWarning: unclosed file <_io.TextIOWrapper
>> name='/Users/jg/.aws/credentials' mode='r' encoding='UTF-8'>
>>   self.shared_credentials.load_from_path(shared_path)
>> Setting up security groups...
>> Creating security group jg_spark2-master
>> Creating security group jg_spark2-slaves
>> Searching for existing cluster jg_spark2 in region us-east-1...
>> Spark AMI: ami-5bb18832
>> Launching instances...
>> Launched 1 slave in us-east-1a, regid = r-e7d97944
>> Launched master in us-east-1a, regid = r-d3d87870
>> Waiting for AWS to propagate instance metadata...
>> Waiting for cluster to enter 'ssh-ready' state
>>
>> Warning: SSH connection error. (This could be temporary.)
>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>> SSH return code: 255
>> SSH output: b'ssh: connect to host
>> ec2-54-88-249-255.compute-1.amazonaws.com
>> port 22: Connection refused'
>>
>>
>> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
>> ResourceWarning: unclosed > family=AddressFamily.AF_INET,
>> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55580),
>> raddr=('54.239.20.1', 443)>
>>   self.queue.pop(0)
>>
>>
>> Warning: SSH connection error. (This could be temporary.)
>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>> SSH return code: 255
>> SSH output: b'ssh: connect to host
>> ec2-54-88-249-255.compute-1.amazonaws.com
>> port 22: Connection refused'
>>
>>
>> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
>> ResourceWarning: unclosed > family=AddressFamily.AF_INET,
>> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55760),
>> raddr=('54.239.26.182', 443)>
>>   self.queue.pop(0)
>>
>>
>> Warning: SSH connection error. (This could be temporary.)
>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>> SSH return code: 255
>> SSH output: b'ssh: connect to host
>> ec2-54-88-249-255.compute-1.amazonaws.com
>> port 22: Connection refused'
>>
>>
>> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
>> ResourceWarning: unclosed > family=AddressFamily.AF_INET,
>> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55827),
>> raddr=('54.239.20.1', 443)>
>>   self.queue.pop(0)
>>
>>
>> Warning: SSH connection error. (This could be temporary.)
>> Host: ec2-54-88-249-255.compute-1.amazonaws.com
>> SSH return code: 255
>> SSH output: b'ssh: connect to host
>> ec2-54-88-249-255.compute-1.amazonaws.com
>> port 22: Connection refused'
>>
>>
>> ./Users/jg/dev/spark-1.6.1-bin-h

Re: A number of issues when running spark-ec2

2016-04-16 Thread Ted Yu
>From the output you posted:
---
Unpacking Spark

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
---

The artifact for spark-1.6.1-bin-hadoop2.6 is corrupt.

This problem has been reported in other threads.

Try spark-1.6.1-bin-hadoop2.7 - the artifact should be good.

On Sat, Apr 16, 2016 at 2:09 PM, YaoPau  wrote:

> I launched a cluster with: "./spark-ec2 --key-pair my_pem --identity-file
> ../../ssh/my_pem.pem launch jg_spark2" and I got the "Spark standalone
> cluster started at http://ec2-54-88-249-255.compute-1.amazonaws.com:8080;
> and "Done!" success confirmations at the end.  I confirmed on EC2 that 1
> Master and 1 Slave were both launched and passed their status checks.
>
> But none of the Spark commands seem to work (spark-shell, pyspark, etc),
> and
> port 8080 isn't being used.  The full output from launching the cluster is
> below.  Any ideas what the issue is?
>
> >>
> launch
>
> jg_spark2/Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/plugin.py:40:
> PendingDeprecationWarning: the imp module is deprecated in favour of
> importlib; see the module's documentation for alternative uses
>   import imp
>
> /Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/provider.py:197:
> ResourceWarning: unclosed file <_io.TextIOWrapper
> name='/Users/jg/.aws/credentials' mode='r' encoding='UTF-8'>
>   self.shared_credentials.load_from_path(shared_path)
> Setting up security groups...
> Creating security group jg_spark2-master
> Creating security group jg_spark2-slaves
> Searching for existing cluster jg_spark2 in region us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Launched 1 slave in us-east-1a, regid = r-e7d97944
> Launched master in us-east-1a, regid = r-d3d87870
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55580),
> raddr=('54.239.20.1', 443)>
>   self.queue.pop(0)
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55760),
> raddr=('54.239.26.182', 443)>
>   self.queue.pop(0)
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55827),
> raddr=('54.239.20.1', 443)>
>   self.queue.pop(0)
>
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-54-88-249-255.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: b'ssh: connect to host
> ec2-54-88-249-255.compute-1.amazonaws.com
> port 22: Connection refused'
>
>
> ./Users/jg/dev/spark-1.6.1-bin-hadoop2.6/ec2/lib/boto-2.34.0/boto/connection.py:190:
> ResourceWarning: unclosed  family=AddressFamily.AF_INET,
> type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.66', 55925),
> raddr=('207.171.162.181', 443)>
>   self.queue.pop(0)
>
> Cluster is now in 'ssh-ready' state. Waited 612 seconds.
> Generating cluster's SSH key on master...
> Warning: Permanently added
> 'ec2-54-88-249-255.compute-1.amazonaws.com,54.88.249.255' (ECDSA) to the
> list of known hosts.
> Connection to ec2-54-88-249-255.compute-1.amazonaws.com closed.
> Warning: Permanently added
> 'ec2-54-88-249-255.compute-1.amazonaws.com,54.88.249.255' (ECDSA) to the
> list of known hosts.
> Transferring cluster's SSH key to slaves...
> ec2-54-209-124-74.compute-1.amazonaws.com
> Warning: Permanently added
> 'ec2-54-209-124-74.compute-1.amazonaws.com,54.209.124.74' (ECDSA) to the
> list of known hosts.
> Cloning spark-ec2 scripts from
> https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
> Warning: Permanently added
> 'ec2-54-88-249-255.compute-1.amazonaws.com,54.88.249.255' (ECDSA) to the
> list of known hosts.
> Cloning into 'spark-ec2'...
> remote: Counting objects: 2072, done.
> 

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Ted Yu
Kevin:
Can you describe how you got over the Metadata fetch exception ?

> On Apr 16, 2016, at 9:41 AM, Kevin Eid  wrote:
> 
> One last email to announce that I've fixed all of the issues. Don't hesitate 
> to contact me if you encounter the same. I'd be happy to help.
> 
> Regards,
> Kevin
> 
>> On 14 Apr 2016 12:39 p.m., "Kevin Eid"  wrote:
>> Hi all, 
>> 
>> I managed to copy my .py files from local to the cluster using SCP . And I 
>> managed to run my Spark app on the cluster against a small dataset. 
>> 
>> However, when I iterate over a dataset of 5GB I got the followings: 
>> org.apache.spark.shuffle.MetadataFetchFailedException + please see the 
>> joined screenshots. 
>> 
>> I am deploying 3*m3.xlarge and using the following parameters while 
>> submitting the app: --executor-memory 50g --driver-memory 20g 
>> --executor-cores 4 --num-executors 3. 
>> 
>> Can you recommend other configurations (driver executors number memory) or 
>> do I have to deploy more and larger instances  in order to run my app on 5GB 
>> ? Or do I need to add more partitions while reading the file? 
>> 
>> Best, 
>> Kevin
>> 
>>> On 12 April 2016 at 12:19, Sun, Rui  wrote:
>>> Which py file is your main file (primary py file)? Zip the other two py 
>>> files. Leave the main py file alone. Don't copy them to S3 because it seems 
>>> that only local primary and additional py files are supported.
>>> 
>>> ./bin/spark-submit --master spark://... --py-files  
>>> 
>>> -Original Message-
>>> From: kevllino [mailto:kevin.e...@mail.dcu.ie]
>>> Sent: Tuesday, April 12, 2016 5:07 PM
>>> To: user@spark.apache.org
>>> Subject: Run a self-contained Spark app on a Spark standalone cluster
>>> 
>>> Hi,
>>> 
>>> I need to know how to run a self-contained Spark app  (3 python files) in a 
>>> Spark standalone cluster. Can I move the .py files to the cluster, or 
>>> should I store them locally, on HDFS or S3? I tried the following locally 
>>> and on S3 with a zip of my .py files as suggested  here 
>>>   :
>>> 
>>> ./bin/spark-submit --master
>>> spark://ec2-54-51-23-172.eu-west-1.compute.amazonaws.com:5080--py-files
>>> s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@mubucket//weather_predict.zip
>>> 
>>> But get: “Error: Must specify a primary resource (JAR or Python file)”
>>> 
>>> Best,
>>> Kevin
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Run-a-self-contained-Spark-app-on-a-Spark-standalone-cluster-tp26753.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> 
>> -- 
>> Kevin EID 
>> M.Sc. in Computing, Data Analytics
>> 
>> 


Re: Apache Flink

2016-04-16 Thread Ted Yu
Looks like this question is more relevant on flink mailing list :-)

On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh 
wrote:

> Hi,
>
> Has anyone used Apache Flink instead of Spark by any chance
>
> I am interested in its set of libraries for Complex Event Processing.
>
> Frankly I don't know if it offers far more than Spark offers.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: ERROR [main] client.ConnectionManager$HConnectionImplementation: The node /hbase is not in ZooKeeper.

2016-04-16 Thread Ted Yu
Please send query to user@hbase

This is the default value:
zookeeper.znode.parent
/hbase

Looks like hbase-site.xml accessible on your client didn't have up-to-date
value for zookeeper.znode.parent

Please make sure hbase-site.xml with proper config is on the classpath.

On Sat, Apr 16, 2016 at 2:31 AM, Eric Gao  wrote:

> Dear expert,
>   I have encountered a problem,when I run hbase cmd :status it shows:
>
> hbase(main):001:0> status
> 2016-04-16 13:03:02,333 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:02,538 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:02,843 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:03,348 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:04,355 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:06,369 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
> 2016-04-16 13:03:10,414 ERROR [main]
> client.ConnectionManager$HConnectionImplementation: The node /hbase is not
> in ZooKeeper. It should have been written by the master. Check the value
> configured in 'zookeeper.znode.parent'. There could be a mismatch with the
> one configured in the master.
>
> How can I solve the problem?
> Thanks very much
>
>
>
>
>
> Eric Gao
> Keep on going never give up.
> Blog:
> http://gaoqiang.blog.chinaunix.net/
> http://gaoqiangdba.blog.163.com/
>
>
>


Re: Logging in executors

2016-04-15 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1

On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas 
wrote:

> Hi guys,
>
> any clue on this? Clearly the
> spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
> executors.
>
> Thanks,
> -carlos.
>
> On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas 
> wrote:
>
>> Hi Yong,
>>
>> thanks for your response. As I said in my first email, I've tried both
>> the reference to the classpath resource (env/dev/log4j-executor.properties)
>> as the file:// protocol. Also, the driver logging is working fine and I'm
>> using the same kind of reference.
>>
>> Below the content of my classpath:
>>
>> [image: Inline image 1]
>>
>> Plus this is the content of the exploded fat jar assembled with sbt
>> assembly plugin:
>>
>> [image: Inline image 2]
>>
>>
>> This folder is at the root level of the classpath.
>>
>> Thanks,
>> -carlos.
>>
>> On Wed, Apr 13, 2016 at 2:35 PM, Yong Zhang  wrote:
>>
>>> Is the env/dev/log4j-executor.properties file within your jar file? Is
>>> the path matching with what you specified as
>>> env/dev/log4j-executor.properties?
>>>
>>> If you read the log4j document here:
>>> https://logging.apache.org/log4j/1.2/manual.html
>>>
>>> When you specify the log4j.configuration=my_custom.properties, you have
>>> 2 option:
>>>
>>> 1) the my_custom.properties has to be in the jar (or in the classpath).
>>> In your case, since you specify the package path, you need to make sure
>>> they are matched in your jar file
>>> 2) use like log4j.configuration=file:///tmp/my_custom.properties. In
>>> this way, you need to make sure file my_custom.properties exists in /tmp
>>> folder on ALL of your worker nodes.
>>>
>>> Yong
>>>
>>> --
>>> Date: Wed, 13 Apr 2016 14:18:24 -0300
>>> Subject: Re: Logging in executors
>>> From: cma...@despegar.com
>>> To: yuzhih...@gmail.com
>>> CC: user@spark.apache.org
>>>
>>>
>>> Thanks for your response Ted. You're right, there was a typo. I changed
>>> it, now I'm executing:
>>>
>>> bin/spark-submit --master spark://localhost:7077 --conf
>>> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
>>> --conf
>>> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-executor.properties"
>>> --class
>>>
>>> The content of this file is:
>>>
>>> # Set everything to be logged to the console
>>> log4j.rootCategory=INFO, FILE
>>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>>> log4j.appender.console.target=System.err
>>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
>>> %c{1}: %m%n
>>>
>>> log4j.appender.FILE=org.apache.log4j.RollingFileAppender
>>> log4j.appender.FILE.File=/tmp/executor.log
>>> log4j.appender.FILE.ImmediateFlush=true
>>> log4j.appender.FILE.Threshold=debug
>>> log4j.appender.FILE.Append=true
>>> log4j.appender.FILE.MaxFileSize=100MB
>>> log4j.appender.FILE.MaxBackupIndex=5
>>> log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
>>> log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
>>> %c{1}: %m%n
>>>
>>> # Settings to quiet third party logs that are too verbose
>>> log4j.logger.org.spark-project.jetty=WARN
>>>
>>> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
>>> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
>>> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
>>> log4j.logger.org.apache.parquet=ERROR
>>> log4j.logger.parquet=ERROR
>>> log4j.logger.com.despegar.p13n=DEBUG
>>>
>>> # SPARK-9183: Settings to avoid annoying messages when looking up
>>> nonexistent UDFs in SparkSQL with Hive support
>>> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
>>> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>>>
>>>
>>> Finally, the code on which I'm using logging in the executor is:
>>>
>>> def groupAndCount(keys: DStream[(String, List[String])])(handler: 
>>> ResultHandler) = {
>>>
>>>   val result = keys.reduceByKey((prior, current) => {
>>> (prior ::: current)
>>>   }).flatMap {
>>> case (date, keys) =>
>>>   val rs = keys.groupBy(x => x).map(
>>>   obs =>{
>>> val (d,t) = date.split("@") match {
>>>   case Array(d,t) => (d,t)
>>> }
>>> import org.apache.log4j.Logger
>>> import scala.collection.JavaConverters._
>>> val logger: Logger = Logger.getRootLogger
>>> logger.info(s"Metric retrieved $d")
>>> Metric("PV", d, obs._1, t, obs._2.size)
>>> }
>>>   )
>>>   rs
>>>   }
>>>
>>>   result.foreachRDD((rdd: RDD[Metric], time: Time) => {
>>> handler(rdd, time)
>>>   })
>>>
>>> }
>>>
>>>
>>> Originally the import and logger object was outside the map function.
>>> I'm also using the root logger 

Re: How to stop hivecontext

2016-04-15 Thread Ted Yu
You can call stop() method. 

> On Apr 15, 2016, at 5:21 AM, ram kumar  wrote:
> 
> Hi,
> I started hivecontext as,
> 
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
> 
> I want to stop this sql context
> 
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Ted Yu
For Parquet, please take a look at SPARK-1251

For ORC, not sure.
Looking at git history, I found ORC mentioned by SPARK-1368

FYI

On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli  wrote:

> I am needing this fact for the research paper I am writing right now.
>
> When did Spark start supporting Parquet and when ORC?
> (what release)
>
> I appreciate any info you can offer.
>
> Thank you,
> Edmon
>


Re: Error with --files

2016-04-14 Thread Ted Yu
bq. localtest.txt#appSees.txt

Which file did you want to pass ?

Thanks

On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen 
wrote:

> Hi All,
>
> I'm trying to use the --files option with yarn:
>
> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
>> /home/ubuntu/localtest.txt#appSees.txt
>
>
> I never see the file in HDFS or in the yarn containers.  Am I doing
> something incorrect ?
>
> I'm running spark 1.6.0
>
>
> Thanks,
> --Ben
>


Re: Spark Yarn closing sparkContext

2016-04-14 Thread Ted Yu
Can you pastebin the failure message ?

Did you happen to take jstack during the close ?

Which Hadoop version do you use ?

Thanks 

> On Apr 14, 2016, at 5:53 AM, nihed mbarek  wrote:
> 
> Hi, 
> I have an issue with closing my application context, the process take a long 
> time with a fail at the end. In other part, my result was generate in the 
> write folder and _SUCESS file was created. 
> I'm using spark 1.6 with yarn. 
> 
> any idea ? 
> 
> regards, 
> 
> -- 
> 
> MBAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
> 
> 
> 


Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu
w.r.t. the effective storage level log, here is the JIRA which introduced
it:

[SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled

On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin 
wrote:

> Hi all,
>
> If I am using a Custom Receiver with Storage Level set to StorageLevel.
> MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs:
>
> 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level 
> replication 2 is unnecessary when write ahead log is enabled, change to 
> replication 1
> 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: User defined storage 
> level StorageLevel(false, true, false, false, 2) is changed to effective 
> storage level StorageLevel(false, true, false, false, 1) when write ahead log 
> is enabled
>
>
> My application is running on 4 Executors with 4 cores each, and 1
> Receiver.  Because the data is not replicated the processing runs on only
> one Executor:
>
> [image: Inline images 1]
>
> Instead of 16 cores processing the Streaming data only 4 are being used.
>
> We cannot reparation the DStream to distribute data to more Executors
> since if you call reparation on an RDD which is only located on one node,
> the new partitions are only created on that node, which doesn't help.  This
> theory that repartitioning doesn't help can be tested with this simple
> example, which tries to go from one partition on a single node to many on
> many nodes.  What you find with when you look at the multiplePartitions RDD
> in the UI is that its 6 partitions are on the same Executor.
>
> scala> val rdd = sc.parallelize(Seq(1, 2, 3, 4, 5, 6), 4).cache.setName("rdd")
> rdd: org.apache.spark.rdd.RDD[Int] = rdd ParallelCollectionRDD[0] at 
> parallelize at :27
>
> scala> rdd.count()
> res0: Long = 6
>
> scala> val singlePartition = 
> rdd.repartition(1).cache.setName("singlePartition")
> singlePartition: org.apache.spark.rdd.RDD[Int] = singlePartition 
> MapPartitionsRDD[4] at repartition at :29
>
> scala> singlePartition.count()
> res1: Long = 6
>
> scala> val multiplePartitions = 
> singlePartition.repartition(6).cache.setName("multiplePartitions")
> multiplePartitions: org.apache.spark.rdd.RDD[Int] = multiplePartitions 
> MapPartitionsRDD[8] at repartition at :31
>
> scala> multiplePartitions.count()
> res2: Long = 6
>
> Am I correct in the use of reparation, that the data does not get shuffled if 
> it is all on one Executor?
>
> Shouldn't I be allowed to set the Receiver replication factor to two when the 
> WAL is enabled so that multiple Executors can work on the Streaming input 
> data?
>
> We will look into creating 4 Receivers so that the data gets distributed
> more evenly.  But won't that "waste" 4 cores in our example, where one
> would do?
>
> Best regards,
> Patrick
>
>
>
>
>


Re: Logging in executors

2016-04-13 Thread Ted Yu
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j.
configuration=env/dev/log4j-driver.properties"

I think the above may have a typo : you refer to log4j-driver.properties in
both arguments.

FYI

On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas 
wrote:

> Hi guys,
>
> I'm trying to enable logging in the executors but with no luck.
>
> According to the oficial documentation and several blogs, this should be
> done passing the
> "spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the
> spark-submit tool. I've tried both sending a reference to a classpath
> resource as using the "file:" protocol but nothing happens. Of course in
> the later case, I've used the --file option in the command line, although
> is not clear where this file is uploaded in the worker machine.
>
> However, I was able to make it work by setting the properties in the
> spark-defaults.conf file pointing to each one of the configurations on the
> machine. This approach has a big drawback though: if I change something in
> the log4j configuration I need to change it in every machine (and I''m not
> sure if restarting is required) which is not what I'm looking for.
>
> The complete command I'm using is as follows:
>
> bin/spark-submit --master spark://localhost:7077 --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties"
> --class [my-main-class] [my-jar].jar
>
>
> Both files are in the classpath and are reachable -- already tested with
> the driver.
>
> Any comments will be welcomed.
>
> Thanks in advance.
> -carlos.
>
>


Re: Old hostname pops up while running Spark app

2016-04-12 Thread Ted Yu
FYI

https://documentation.cpanel.net/display/CKB/How+To+Clear+Your+DNS+Cache#HowToClearYourDNSCache-MacOS
®10.10
https://www.whatsmydns.net/flush-dns.html#linux

On Tue, Apr 12, 2016 at 2:44 PM, Bibudh Lahiri 
wrote:

> Hi,
>
> I am trying to run a piece of code with logistic regression on
> PySpark. I’ve run it successfully on my laptop, and I have run it
> previously on a standalone cluster mode, but the name of the server on
> which I am running it was changed in between (the old name was
> "IMPETUS-1466") by the admin. Now, when I am trying to run, it is
> throwing the following error:
>
> File
> "/home/impadmin/Nikunj/spark-1.6.0/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 53, in deco
>
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
>
> pyspark.sql.utils.IllegalArgumentException:
> u'java.net.UnknownHostException: IMPETUS-1466.
>
>I have changed a few configuration files, and /etc/hosts, and
> regenerated the SSH keys, updated the files .ssh/known_hosts and 
> .ssh/authorized_keys,
> but still this is not getting resolved. Can someone please point out where
> this name is being picked up from?
>
> --
> Bibudh Lahiri
> Data Scientist, Impetus Technolgoies
> 5300 Stevens Creek Blvd
> San Jose, CA 95129
> http://knowthynumbers.blogspot.com/
>
>


Re: JavaRDD with custom class?

2016-04-12 Thread Ted Yu
You can find various examples involving Serializable Java POJO
e.g.
./examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java

Please pastebin some details on 'Task not serializable error'

Thanks

On Tue, Apr 12, 2016 at 12:44 PM, Daniel Valdivia 
wrote:

> Hi,
>
> I'm moving some code from Scala to Java and I just hit a wall where I'm
> trying to move an RDD with a custom data structure to java, but I'm not
> being able to do so:
>
> Scala Code:
>
> case class IncodentDoc(system_id: String, category: String, terms:
> Seq[String])
> var incTup = inc_filtered.map(record => {
>  //some logic
>   TermDoc(sys_id, category, termsSeq)
> })
>
> On Java I'm trying:
>
> class TermDoc implements Serializable  {
> public String system_id;
> public String category;
> public String[] terms;
>
> public TermDoc(String system_id, String category, String[] terms) {
> this.system_id = system_id;
> this.category = category;
> this.terms = terms;
> }
> }
>
> JavaRDD incTup = inc_filtered.map(record -> {
> //some code
> return new TermDoc(sys_id, category, termsArr);
> });
>
>
> When I run my code, I get hit with a Task not serializable error, what am
> I missing so I can use custom classes inside the RDD just like in scala?
>
> Cheers
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Ted Yu
See
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

On Tue, Apr 12, 2016 at 8:52 AM, ImMr.K <875061...@qq.com> wrote:

> But how to import spark repo into idea or eclipse?
>
>
>
> -- 原始邮件 ----------
> *发件人:* Ted Yu <yuzhih...@gmail.com>
> *发送时间:* 2016年4月12日 23:38
> *收件人:* ImMr.K <875061...@qq.com>
> *抄送:* user <user@spark.apache.org>
> *主题:* Re: build/sbt gen-idea error
>
> gen-idea doesn't seem to be a valid command:
>
> [warn] Ignoring load failure: no project loaded.
> [error] Not a valid command: gen-idea
> [error] gen-idea
>
> On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:
>
>> Hi,
>> I have cloned spark and ,
>> cd spark
>> build/sbt gen-idea
>>
>> got the following output:
>>
>>
>> Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
>> Note, this will be overridden by -java-home if it is set.
>> [info] Loading project definition from
>> /home/king/github/spark/project/project
>> [info] Loading project definition from
>> /home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
>> [warn] Multiple resolvers having different access mechanism configured
>> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
>> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
>> [info] Loading project definition from /home/king/github/spark/project
>> org.apache.maven.model.building.ModelBuildingException: 1 problem was
>> encountered while building the effective model for
>> org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
>> [FATAL] Non-resolvable parent POM: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (
>> http://repo.maven.apache.org/maven2): Error transferring file:
>> Connection timed out from
>> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
>> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
>>
>> at
>> org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
>> at
>> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
>> at
>> com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
>> at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
>> at
>> com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
>> at
>> com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
>> at
>> com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
>> at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
>> at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
>> at sbt.Load$$anonfun$27.apply(Load.scala:446)
>> at sbt.Load$$anonfun$27.apply(Load.scala:446)
>> at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
>> at sbt.Load$.loadUnit(Load.scala:446)
>> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
>> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
>> at
>> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
>> at
>> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
>> at sbt.BuildLoader.apply(BuildLoader.scala:140)
>> at sbt.Load$.loadAll(Load.scala:344)
>> at sbt.Load$.loadURI(Load.scala:299)
>> at sbt.Load$.load(Load.scala:295)
>> at sbt.Load$.load(Load.scala:286)
>> at sbt.Load$.apply(Load.scala:140)
>> at sbt.Load$.defaultLoad(Load.scala:36)
>> at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
>> at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
>> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
>> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
>> at
>> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
>> at
>> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
>> at
>> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.sca

Re: build/sbt gen-idea error

2016-04-12 Thread Ted Yu
gen-idea doesn't seem to be a valid command:

[warn] Ignoring load failure: no project loaded.
[error] Not a valid command: gen-idea
[error] gen-idea

On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote:

> Hi,
> I have cloned spark and ,
> cd spark
> build/sbt gen-idea
>
> got the following output:
>
>
> Using /usr/java/jre1.7.0_09 as default JAVA_HOME.
> Note, this will be overridden by -java-home if it is set.
> [info] Loading project definition from
> /home/king/github/spark/project/project
> [info] Loading project definition from
> /home/king/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
> [warn] Multiple resolvers having different access mechanism configured
> with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate
> project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
> [info] Loading project definition from /home/king/github/spark/project
> org.apache.maven.model.building.ModelBuildingException: 1 problem was
> encountered while building the effective model for
> org.apache.spark:spark-parent_2.11:2.0.0-SNAPSHOT
> [FATAL] Non-resolvable parent POM: Could not transfer artifact
> org.apache:apache:pom:14 from/to central (
> http://repo.maven.apache.org/maven2): Error transferring file: Connection
> timed out from
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and 'parent.relativePath' points at wrong local POM @ line 22, column 11
>
> at
> org.apache.maven.model.building.DefaultModelProblemCollector.newModelBuildingException(DefaultModelProblemCollector.java:195)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParentExternally(DefaultModelBuilder.java:841)
> at
> org.apache.maven.model.building.DefaultModelBuilder.readParent(DefaultModelBuilder.java:664)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:310)
> at
> org.apache.maven.model.building.DefaultModelBuilder.build(DefaultModelBuilder.java:232)
> at
> com.typesafe.sbt.pom.MvnPomResolver.loadEffectivePom(MavenPomResolver.scala:61)
> at com.typesafe.sbt.pom.package$.loadEffectivePom(package.scala:41)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeProjectTree(MavenProjectHelper.scala:128)
> at
> com.typesafe.sbt.pom.MavenProjectHelper$.makeReactorProject(MavenProjectHelper.scala:49)
> at
> com.typesafe.sbt.pom.PomBuild$class.projectDefinitions(PomBuild.scala:28)
> at SparkBuild$.projectDefinitions(SparkBuild.scala:347)
> at sbt.Load$.sbt$Load$$projectsFromBuild(Load.scala:506)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at sbt.Load$$anonfun$27.apply(Load.scala:446)
> at scala.collection.immutable.Stream.flatMap(Stream.scala:442)
> at sbt.Load$.loadUnit(Load.scala:446)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at sbt.Load$$anonfun$18$$anonfun$apply$11.apply(Load.scala:291)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:91)
> at
> sbt.BuildLoader$$anonfun$componentLoader$1$$anonfun$apply$4$$anonfun$apply$5$$anonfun$apply$6.apply(BuildLoader.scala:90)
> at sbt.BuildLoader.apply(BuildLoader.scala:140)
> at sbt.Load$.loadAll(Load.scala:344)
> at sbt.Load$.loadURI(Load.scala:299)
> at sbt.Load$.load(Load.scala:295)
> at sbt.Load$.load(Load.scala:286)
> at sbt.Load$.apply(Load.scala:140)
> at sbt.Load$.defaultLoad(Load.scala:36)
> at sbt.BuiltinCommands$.liftedTree1$1(Main.scala:492)
> at sbt.BuiltinCommands$.doLoadProject(Main.scala:492)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at sbt.BuiltinCommands$$anonfun$loadProjectImpl$2.apply(Main.scala:484)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$1$$anonfun$apply$2.apply(Command.scala:59)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at
> sbt.Command$$anonfun$applyEffect$2$$anonfun$apply$3.apply(Command.scala:61)
> at sbt.Command$.process(Command.scala:93)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1$$anonfun$apply$1.apply(MainLoop.scala:96)
> at sbt.State$$anon$1.process(State.scala:184)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.MainLoop$$anonfun$1.apply(MainLoop.scala:96)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
> at sbt.MainLoop$.next(MainLoop.scala:96)
> at sbt.MainLoop$.run(MainLoop.scala:89)
> at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:68)
> at sbt.MainLoop$$anonfun$runWithNewLog$1.apply(MainLoop.scala:63)
> at sbt.Using.apply(Using.scala:24)
> at sbt.MainLoop$.runWithNewLog(MainLoop.scala:63)
> at sbt.MainLoop$.runAndClearLast(MainLoop.scala:46)
> at sbt.MainLoop$.runLoggedLoop(MainLoop.scala:30)
> at sbt.MainLoop$.runLogged(MainLoop.scala:22)
> at sbt.StandardMain$.runManaged(Main.scala:54)
> at sbt.xMain.run(Main.scala:29)
> at 

Re: Is storage resources counted during the scheduling

2016-04-11 Thread Ted Yu
See
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

On Mon, Apr 11, 2016 at 3:15 PM, Jialin Liu  wrote:

> Hi Spark users/experts,
>
> I’m wondering how does the Spark scheduler work?
> What kind of resources will be considered during the scheduling, does it
> include the disk resources or I/O resources, e.g., number of IO ports.
> Is network resources considered in that?
>
> My understanding is that only CPU is considered, right?
>
> Best,
> Jialin
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Read JSON in Dataframe and query

2016-04-11 Thread Ted Yu
Please take a look
at 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

Cheers

On Mon, Apr 11, 2016 at 12:13 PM, Radhakrishnan Iyer <
radhakrishnan.i...@citiustech.com> wrote:

> Hi all,
>
>
>
> I am new to Spark.
>
> I have a json in below format :
>
> Employee : {“id” : “001”,”name”:”abc”, “address” : [ {“state” :
> “maharashtra”,”country” : “india” } , {“state” : “gujarat”,”country” :
> “india” }]}
>
>
>
> I want to parse in such a way that I can query address
>
> For e.g Get list of states where “country” = “india” for id=001
>
>
>
> Can you please help me with query for this scenario.
>
>
>
>
>
> Best Regards,
>
> Radhakrishnan Iyer
>


Re: Hello !

2016-04-11 Thread Ted Yu
For SparkR, please refer to https://spark.apache.org/docs/latest/sparkr.html

bq. on Ubuntu or CentOS

Both platforms are supported.

On Mon, Apr 11, 2016 at 1:08 PM,  wrote:

> Dear Experts ,
>
> I am posting this for your information. I am a newbie to spark.
> I am interested in understanding Spark at the internal level.
>
> I need your opinion, which unix flavor should I install spark on Ubuntu or
> CentOS. I have had enough trouble with the windows version (1.6.1 with
> Hadoop 2.6 pre built binaries , keeps giving me exceptions ).
>
> I have worked on R on windows till date . Is there an R for unix? I have
> not googled this either. Sorry about that.Just want to make sure SparkR has
> a smooth run.
>
> Thanks in advance.
> Harry
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Weird error while serialization

2016-04-10 Thread Ted Yu
Have you considered using PairRDDFunctions.aggregateByKey
or PairRDDFunctions.reduceByKey in place of the groupBy to achieve better
performance ?

Cheers

On Sat, Apr 9, 2016 at 2:00 PM, SURAJ SHETH <shet...@gmail.com> wrote:

> Hi,
> I am using Spark 1.5.2
>
> The file contains 900K rows each with twelve fields (tab separated):
> The first 11 fields are Strings with a maximum of 20 chars each. The last
> field is a comma separated array of floats with 8,192 values.
>
> It works perfectly if I change the below code for groupBy from
> "x[0].split('\t')[1]" to "x[0]".
> The reason seems to be due to the limit of the number of values for a
> particular key in groupby. In the below code, I am expecting 500 keys with
> tens of thousands of values in a few of them. The largest key value
> pair(from groupByKey) has 53K values each having a numpy array of 8192
> floats.
> In the changed version, i.e. "groupBy(lambda x : x[0]).mapValues(", we get
> 900K keys and one value for each of them which works flawlessly.
>
> Do we have any limit on the amount of data we get for a key in groupBy?
>
> The total file size is 16 GB.
>
> The snippet is :
>
> import hashlib,re, numpy as np
>
> def getRows(z):
> return np.asfortranarray([float(g) for g in z.split(',')])
>
> text1 = sc.textFile('/textFile.txt',480).filter(lambda x : len(x)>1000)\
> .map(lambda x : x.rsplit('\t',1)).map(lambda x :
> [x[0],getRows(x[1])]).cache()\
> .groupBy(lambda x : x[0].split('\t')[1]).mapValues(lambda x :
> list(x)).cache()
>
> text1.count()
>
> Thanks and Regards,
> Suraj Sheth
>
> On Sun, Apr 10, 2016 at 1:19 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> The value was out of the range of integer.
>>
>> Which Spark release are you using ?
>>
>> Can you post snippet of code which can reproduce the error ?
>>
>> Thanks
>>
>> On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH <shet...@gmail.com> wrote:
>>
>>> I am trying to perform some processing and cache and count the RDD.
>>> Any solutions?
>>>
>>> Seeing a weird error :
>>>
>>> File 
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1456909219314_0014/container_1456909219314_0014_01_04/pyspark.zip/pyspark/serializers.py",
>>>  line 550, in write_int
>>> stream.write(struct.pack("!i", value))
>>> error: 'i' format requires -2147483648 <= number <= 2147483647
>>>
>>> at 
>>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>>> at 
>>> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>>> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>>> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Suraj Sheth
>>>
>>>
>>
>


Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-10 Thread Ted Yu
Jasmine:
Let's know if listening to more events would give you better picture.

Thanks

On Thu, Apr 7, 2016 at 1:54 PM, Jasmine George <j.geo...@samsung.com> wrote:

> Hi Ted,
>
>
>
> Thanks for replying so fast.
>
>
>
> We are using spark 1.5.2.
>
> I was collecting only TaskEnd Events.
>
> I can do the event wise summation for couple of runs and get back to you.
>
>
>
> Thanks,
>
> Jasmine
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* Thursday, April 07, 2016 1:43 PM
> *To:* JasmineGeorge
> *Cc:* user
> *Subject:* Re: Only 60% of Total Spark Batch Application execution time
> spent in Task Processing
>
>
>
> Which Spark release are you using ?
>
>
>
> Have you registered to all the events provided by SparkListener ?
>
>
>
> If so, can you do event-wise summation of execution time ?
>
>
>
> Thanks
>
>
>
> On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge <j.geo...@samsung.com>
> wrote:
>
> We are running a batch job with the following specifications
> •   Building RandomForest with config : maxbins=100, depth=19, num of
> trees =
> 20
> •   Multiple runs with different input data size 2.8 GB, 10 Million
> records
> •   We are running spark application on Yarn in cluster mode, with 3
> Node
> Managers(each with 16 virtual cores and 96G RAM)
> •   Spark config :
> o   spark.driver.cores = 2
> o   spark.driver.memory = 32 G
> o   spark.executor.instances = 5  and spark.executor.cores = 8 so 40
> cores in
> total.
> o   spark.executor.memory= 32G so total executor memory around 160 G.
>
> We are collecting execution times for the tasks using a SparkListener, and
> also the total execution time for the application from the Spark Web UI.
> Across all the tests we saw consistently that,  sum total of the execution
> times of all the tasks is accounting to about 60% of the total application
> run time.
> We are just kind of wondering where is the rest of the 40% of the time
> being
> spent.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-60-of-Total-Spark-Batch-Application-execution-time-spent-in-Task-Processing-tp26703.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


Re: Datasets combineByKey

2016-04-10 Thread Ted Yu
Haven't found any JIRA w.r.t. combineByKey for Dataset.

What's your use case ?

Thanks

On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela  wrote:

> Is there (planned ?) a combineByKey support for Dataset ?
> Is / Will there be a support for combiner lifting ?
>
> Thanks,
> Amit
>


Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Ted Yu
Looks like the exception occurred on driver.

Consider increasing the values for the following config:

conf.set("spark.driver.memory", "10240m")
conf.set("spark.driver.maxResultSize", "2g")

Cheers

On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev  wrote:

> I'm running it via pyspark against yarn in client deploy mode. I do notice
> in the spark web ui under Environment tab all the options I've set, so I'm
> guessing these are accepted.
>
> On Sat, Apr 9, 2016 at 5:52 PM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> (I haven't played with GraphFrames)
>>
>> What's your `sc.master`? How do you run your application --
>> spark-submit or java -jar or sbt run or...? The reason I'm asking is
>> that few options might not be in use whatsoever, e.g.
>> spark.driver.memory and spark.executor.memory in local mode.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Apr 9, 2016 at 7:51 PM, Buntu Dev  wrote:
>> > I'm running this motif pattern against 1.5M vertices (5.5mb) and 10M
>> (60mb)
>> > edges:
>> >
>> >  tgraph.find("(a)-[]->(b); (c)-[]->(b); (c)-[]->(d)")
>> >
>> > I keep running into Java heap space errors:
>> >
>> > ~
>> >
>> > ERROR actor.ActorSystemImpl: Uncaught fatal error from thread
>> > [sparkDriver-akka.actor.default-dispatcher-33] shutting down ActorSystem
>> > [sparkDriver]
>> > java.lang.OutOfMemoryError: Java heap space
>> > at scala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:90)
>> > at scala.reflect.ManifestFactory$$anon$6.newArray(Manifest.scala:88)
>> > at scala.Array$.ofDim(Array.scala:218)
>> > at akka.util.ByteIterator.toArray(ByteIterator.scala:462)
>> > at akka.util.ByteString.toArray(ByteString.scala:321)
>> > at
>> >
>> akka.remote.transport.AkkaPduProtobufCodec$.decodePdu(AkkaPduCodec.scala:168)
>> > at
>> >
>> akka.remote.transport.ProtocolStateActor.akka$remote$transport$ProtocolStateActor$$decodePdu(AkkaProtocolTransport.scala:513)
>> > at
>> >
>> akka.remote.transport.ProtocolStateActor$$anonfun$5.applyOrElse(AkkaProtocolTransport.scala:357)
>> > at
>> >
>> akka.remote.transport.ProtocolStateActor$$anonfun$5.applyOrElse(AkkaProtocolTransport.scala:352)
>> > at
>> >
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>> > at akka.actor.FSM$class.processEvent(FSM.scala:595)
>> > at
>> >
>> akka.remote.transport.ProtocolStateActor.processEvent(AkkaProtocolTransport.scala:220)
>> > at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:589)
>> > at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:583)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> > at
>> >
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> > at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >
>> > ~
>> >
>> >
>> > Here is my config:
>> >
>> > conf.set("spark.executor.memory", "8192m")
>> > conf.set("spark.executor.cores", 4)
>> > conf.set("spark.driver.memory", "10240m")
>> > conf.set("spark.driver.maxResultSize", "2g")
>> > conf.set("spark.kryoserializer.buffer.max", "1024mb")
>> >
>> >
>> > Wanted to know if there are any other configs to tweak?
>> >
>> >
>> > Thanks!
>>
>
>


Re: Weird error while serialization

2016-04-09 Thread Ted Yu
The value was out of the range of integer.

Which Spark release are you using ?

Can you post snippet of code which can reproduce the error ?

Thanks

On Sat, Apr 9, 2016 at 12:25 PM, SURAJ SHETH  wrote:

> I am trying to perform some processing and cache and count the RDD.
> Any solutions?
>
> Seeing a weird error :
>
> File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1456909219314_0014/container_1456909219314_0014_01_04/pyspark.zip/pyspark/serializers.py",
>  line 550, in write_int
> stream.write(struct.pack("!i", value))
> error: 'i' format requires -2147483648 <= number <= 2147483647
>
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>
>
> Thanks and Regards,
>
> Suraj Sheth
>
>


Re: Unable run Spark in YARN mode

2016-04-09 Thread Ted Yu
mahesh :

bq. :16: error: not found: value sqlContext

Please take a look at:

https://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext

for how the import should be used.

Please include version of Spark and the commandline you used in the reply.


Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Not much.

So no chance of different snappy version ?

On Fri, Apr 8, 2016 at 1:26 PM, Nicolas Tilmans <ntilm...@gmail.com> wrote:

> Hi Ted,
>
> The Spark version is 1.6.1, a nearly identical set of operations does fine
> on smaller datasets. It's just a few joins then a groupBy and a count in
> pyspark.sql on a Spark DataFrame.
>
> Any ideas?
>
> Nicolas
>
> On Fri, Apr 8, 2016 at 1:13 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Did you encounter similar error on a smaller dataset ?
>>
>> Which release of Spark are you using ?
>>
>> Is it possible you have an incompatible snappy version somewhere in your
>> classpath ?
>>
>> Thanks
>>
>> On Fri, Apr 8, 2016 at 12:36 PM, entee <ntilm...@gmail.com> wrote:
>>
>>> I'm trying to do a relatively large join (0.5TB shuffle read/write) and
>>> just
>>> calling a count (or show) on the dataframe fails to complete, getting to
>>> the
>>> last task before failing:
>>>
>>> Py4JJavaError: An error occurred while calling o133.count.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 5
>>> in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in stage
>>> 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress the
>>> chunk: PARSING_ERROR(2)
>>>
>>> (full stack trace below)
>>>
>>> I'm not sure why this would happen, any ideas about whether our
>>> configuration is off or how to fix this?
>>>
>>> Nicolas
>>>
>>>
>>>
>>> Py4JJavaError: An error occurred while calling o133.count.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 5
>>> in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in stage
>>> 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress the
>>> chunk: PARSING_ERROR(2)
>>> at
>>>
>>> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
>>> at
>>> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
>>> at
>>> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>>> at
>>> java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>>> at java.io.DataInputStream.read(DataInputStream.java:149)
>>> at
>>> org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
>>> at
>>> org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
>>> at
>>>
>>> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
>>> at
>>>
>>> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
>>> at scala.collection.Iterator$$anon$12.next(Iterator.scala:397)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>>> at
>>>
>>> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>>> at
>>>
>>> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>>> at
>>>
>>> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>>> at
>>> org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>>> at
>>> org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>>> at
>>>
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>>> at
>>>
>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>>
>>> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>>> at
>>> org.apache.spark.rdd.RDD.comput

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Did you encounter similar error on a smaller dataset ?

Which release of Spark are you using ?

Is it possible you have an incompatible snappy version somewhere in your
classpath ?

Thanks

On Fri, Apr 8, 2016 at 12:36 PM, entee  wrote:

> I'm trying to do a relatively large join (0.5TB shuffle read/write) and
> just
> calling a count (or show) on the dataframe fails to complete, getting to
> the
> last task before failing:
>
> Py4JJavaError: An error occurred while calling o133.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5
> in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress the
> chunk: PARSING_ERROR(2)
>
> (full stack trace below)
>
> I'm not sure why this would happen, any ideas about whether our
> configuration is off or how to fix this?
>
> Nicolas
>
>
>
> Py4JJavaError: An error occurred while calling o133.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5
> in stage 11.0 failed 4 times, most recent failure: Lost task 5.3 in stage
> 11.0 (TID 7836, .com): java.io.IOException: failed to uncompress the
> chunk: PARSING_ERROR(2)
> at
>
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:361)
> at
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
> at
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> at java.io.DataInputStream.read(DataInputStream.java:149)
> at
> org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
> at
> org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
> at
>
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
> at
>
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
> at scala.collection.Iterator$$anon$12.next(Iterator.scala:397)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
> at
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
> at
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
> at
>
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
> at
> org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
> at
> org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
>
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
>
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:96)
> at
>
> org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:95)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at
>
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:163)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> 

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread Ted Yu
I searched 1.6.1 code base but didn't find how this can be configured
(within Spark).

On Fri, Apr 8, 2016 at 9:01 AM, nihed mbarek  wrote:

> Hi
> How to configure parquet.block.size on Spark 1.6 ?
>
> Thank you
> Nihed MBAREK
>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: can not join dataset with itself

2016-04-08 Thread Ted Yu
Looks like you're using Spark 1.6.x

What error(s) did you get for the first two joins ?

Thanks

On Fri, Apr 8, 2016 at 3:53 AM, JH P  wrote:

> Hi. I want a dataset join with itself. So i tried below codes.
>
> 1. newGnsDS.joinWith(newGnsDS, $"dataType”)
>
> 2. newGnsDS.as("a").joinWith(newGnsDS.as("b"), $"a.dataType" === $
> "b.datatype”)
>
> 3. val a = newGnsDS.map(x => x).as("a")
>val b = newGnsDS.map(x => x).as("b")
>
>
>a.joinWith(b, $"a.dataType" === $"b.datatype")
>
> 1,2 doesn’t work, but 3 works. I don’t know why it works, better idea
> exists. please help
>


Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ?

Have you registered to all the events provided by SparkListener ?

If so, can you do event-wise summation of execution time ?

Thanks

On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge  wrote:

> We are running a batch job with the following specifications
> •   Building RandomForest with config : maxbins=100, depth=19, num of
> trees =
> 20
> •   Multiple runs with different input data size 2.8 GB, 10 Million
> records
> •   We are running spark application on Yarn in cluster mode, with 3
> Node
> Managers(each with 16 virtual cores and 96G RAM)
> •   Spark config :
> o   spark.driver.cores = 2
> o   spark.driver.memory = 32 G
> o   spark.executor.instances = 5  and spark.executor.cores = 8 so 40
> cores in
> total.
> o   spark.executor.memory= 32G so total executor memory around 160 G.
>
> We are collecting execution times for the tasks using a SparkListener, and
> also the total execution time for the application from the Spark Web UI.
> Across all the tests we saw consistently that,  sum total of the execution
> times of all the tasks is accounting to about 60% of the total application
> run time.
> We are just kind of wondering where is the rest of the 40% of the time
> being
> spent.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-60-of-Total-Spark-Batch-Application-execution-time-spent-in-Task-Processing-tp26703.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on:

[INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile

On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed 
wrote:

> Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0"
> compile. I guess the internal dependencies are automatically pulled when
> you add spark-streaming-kafka_2.10.
>
> Also try changing the version to 1.6.1 or lower. Just to see if the links
> are broken.
>
> Regards,
> Haroon Syed
>
> On 7 April 2016 at 09:08, Sudhanshu Janghel <
> sudhanshu.jang...@cloudwick.com> wrote:
>
>> Hello,
>>
>> I am new to building kafka and wish to understand how to make fat jars in
>> intellij.
>> The sbt assembly seems confusing and I am unable to resolve the
>> dependencies.
>>
>> here is my build.sbt
>>
>> name := "twitter"
>>
>> version := "1.0"
>> scalaVersion := "2.10.4"
>>
>>
>>
>> //libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.7" % "provided"
>> //libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.7" % "provided"
>> //libraryDependencies += "com.google.guava" % "guava" % "11.0.2" 
>> exclude("log4j", "log4j") exclude("org.slf4j","slf4j-log4j12") 
>> exclude("org.slf4j","slf4j-api")
>> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
>> libraryDependencies +=   "org.apache.kafka" %% "kafka"  % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
>> "1.6.0"
>>
>>
>>
>> adn here is my assembly.sbt
>>
>> addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
>>
>>
>> the error faced is
>>
>> Error:Error while importing SBT project:...[info] Resolving 
>> org.scala-sbt#tracking;0.13.8 ...
>> [info] Resolving org.scala-sbt#cache;0.13.8 ...
>> [info] Resolving org.scala-sbt#testing;0.13.8 ...
>> [info] Resolving org.scala-sbt#test-agent;0.13.8 ...
>> [info] Resolving org.scala-sbt#test-interface;1.0 ...
>> [info] Resolving org.scala-sbt#main-settings;0.13.8 ...
>> [info] Resolving org.scala-sbt#apply-macro;0.13.8 ...
>> [info] Resolving org.scala-sbt#command;0.13.8 ...
>> [info] Resolving org.scala-sbt#logic;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_8_2;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_9_2;0.13.8 ...
>> [info] Resolving org.scala-sbt#precompiled-2_9_3;0.13.8 ...
>> [trace] Stack trace suppressed: run 'last *:update' for the full output.
>> [trace] Stack trace suppressed: run 'last *:ssExtractDependencies' for the 
>> full output.
>> [error] (*:update) sbt.ResolveException: unresolved dependency: 
>> com.eed3si9n#sbt-assembly;0.14.3: not found
>> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
>> [error] (*:ssExtractDependencies) sbt.ResolveException: unresolved 
>> dependency: com.eed3si9n#sbt-assembly;0.14.3: not found
>> [error] unresolved dependency: org.apache.kafka#kafka_2.10;1.6.0: not found
>> [error] Total time: 18 s, completed Apr 7, 2016 5:05:06 PM
>>
>>
>>
>>
>


Re: how to query the number of running executors?

2016-04-06 Thread Ted Yu
Have you looked at SparkListener ?

  /**
   * Called when the driver registers a new executor.
   */
  def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit

  /**
   * Called when the driver removes an executor.
   */
  def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit

FYI

On Wed, Apr 6, 2016 at 11:50 AM, Cesar Flores  wrote:

>
> Hello:
>
> I wonder if there is a way to query the number of running executors (nor
> the number asked executors) inside a spark job?
>
>
> Thanks
> --
> Cesar Flores
>


Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Ted Yu
Which hadoop release are you using ?

bq. yarn cluster with 2GB RAM

I assume 2GB is per node. Isn't this too low for your use case ?

Cheers

On Wed, Apr 6, 2016 at 9:19 AM, Peter Rudenko 
wrote:

> Hi i have a situation, say i have a yarn cluster with 2GB RAM. I'm
> submitting 2 spark jobs with "driver-memory 1GB --num-executors 2
> --executor-memory 1GB". So i see 2 spark AM running, but they are unable to
> allocate workers containers and start actual job. And they are hanging for
> a while. Is it possible to set some sort of timeout for acquiring executors
> otherwise kill application?
>
> Thanks,
> Peter Rudenko
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
The error was due to REPL expecting an integer (index to the Array) whereas
"MAX(count)" was a String.

What do you want to achieve ?

On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel  wrote:

> Hello,
>
> i am writing one spark application i which i need the index of the maximum
> element.
>
> My table has one column only and i want the index of the maximum element.
>
> MAX(count)
> 23
> 32
> 3
> Here is my code the data type of the array is
> org.apache.spark.sql.Dataframe.
>
>
> Thanks in advance.
> Also please suggest me to do it in another way.
>
> [image: Inline image 1]
>


Re: dataframe sorting and find the index of the maximum element

2016-04-05 Thread Ted Yu
Did you define idxmax() method yourself ?

Thanks

On Tue, Apr 5, 2016 at 4:17 AM, Angel Angel  wrote:

> Hello,
>
> i am writing one spark application i which i need the index of the maximum
> element.
>
> My table has one column only and i want the index of the maximum element.
>
> MAX(count)
> 23
> 32
> 3
> Here is my code the data type of the array is
> org.apache.spark.sql.Dataframe.
>
>
> Thanks in advance.
> Also please suggest me to do it in another way.
>
> [image: Inline image 1]
>


Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-04 Thread Ted Yu
bq. I'm on version 2.10 for spark

The above is Scala version.
Can you give us the Spark version ?

Thanks

On Mon, Apr 4, 2016 at 2:36 PM, mpawashe  wrote:

> Hi all,
>
> I am using Spark Streaming API (I'm on version 2.10 for spark and
> streaming), and I am running into a function serialization issue that I do
> not run into when using Spark in batch (non-streaming) mode.
>
> If I wrote code like this:
>
> def run(): Unit = {
> val newStream = stream.map(x => { x + " foo " })
> // ...
> }
>
> everything works fine.. But if I try it like this:
>
> def transform(x: String): String = { x + " foo " }
>
> def run(): Unit = {
> val newStream = stream.map(transform)
> // ...
> }
>
> ..the program fails being unable to serialize the closure (which when
> passing a method to a function that expects a closure, it should be
> auto-converted to my understanding).
>
> However it works fine if I declare a closure inside run() and use that like
> so:
>
> val transform = (x: String) => { x + " foo " }
>
> If it's declared outside of run(), however, it will also crash.
>
> This is an example stack trace of the error I'm running into. This can be a
> hassle to debug so I hope I wouldn't have to get around this by having to
> use a local closure/function every time. Thanks for any help in advance.
>
> org.apache.spark.SparkException: Task not serializable
> at
>
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
> at
>
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
> at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
> at
>
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
> at
>
> org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
> at
>
> org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:266)
> at
> org.apache.spark.streaming.dstream.DStream.map(DStream.scala:527)
> at com.my.cool.app.MyClass.run(MyClass.scala:90)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
> at
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: java.io.NotSerializableException: Graph is unexpectedly null
> when
> DStream is being serialized.
> Serialization stack:
>
> at
>
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at
>
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
> at
>
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
> at
>
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
> ... 20 more
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Ted Yu
bq. the modifications do not touch the scheduler

If the changes can be ported over to 1.6.1, do you mind reproducing the
issue there ?

I ask because master branch changes very fast. It would be good to narrow
the scope where the behavior you observed started showing.

On Mon, Apr 4, 2016 at 6:12 AM, Mike Hynes <91m...@gmail.com> wrote:

> [ CC'ing dev list since nearly identical questions have occurred in
> user list recently w/o resolution;
> c.f.:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html
> ]
>
> Hello,
>
> In short, I'm reporting a problem concerning load imbalance of RDD
> partitions across a standalone cluster. Though there are 16 cores
> available per node, certain nodes will have >16 partitions, and some
> will correspondingly have <16 (and even 0).
>
> In more detail: I am running some scalability/performance tests for
> vector-type operations. The RDDs I'm considering are simple block
> vectors of type RDD[(Int,Vector)] for a Breeze vector type. The RDDs
> are generated with a fixed number of elements given by some multiple
> of the available cores, and subsequently hash-partitioned by their
> integer block index.
>
> I have verified that the hash partitioning key distribution, as well
> as the keys themselves, are both correct; the problem is truly that
> the partitions are *not* evenly distributed across the nodes.
>
> For instance, here is a representative output for some stages and
> tasks in an iterative program. This is a very simple test with 2
> nodes, 64 partitions, 32 cores (16 per node), and 2 executors. Two
> examples stages from the stderr log are stages 7 and 9:
> 7,mapPartitions at DummyVector.scala:113,64,1459771364404,1459771365272
> 9,mapPartitions at DummyVector.scala:113,64,1459771364431,1459771365639
>
> When counting the location of the partitions on the compute nodes from
> the stderr logs, however, you can clearly see the imbalance. Examples
> lines are:
> 13627 task 0.0 in stage 7.0 (TID 196,
> himrod-2, partition 0,PROCESS_LOCAL, 3987 bytes)&
> 13628 task 1.0 in stage 7.0 (TID 197,
> himrod-2, partition 1,PROCESS_LOCAL, 3987 bytes)&
> 13629 task 2.0 in stage 7.0 (TID 198,
> himrod-2, partition 2,PROCESS_LOCAL, 3987 bytes)&
>
> Grep'ing the full set of above lines for each hostname, himrod-?,
> shows the problem occurs in each stage. Below is the output, where the
> number of partitions stored on each node is given alongside its
> hostname as in (himrod-?,num_partitions):
> Stage 7: (himrod-1,0) (himrod-2,64)
> Stage 9: (himrod-1,16) (himrod-2,48)
> Stage 12: (himrod-1,0) (himrod-2,64)
> Stage 14: (himrod-1,16) (himrod-2,48)
> The imbalance is also visible when the executor ID is used to count
> the partitions operated on by executors.
>
> I am working off a fairly recent modification of 2.0.0-SNAPSHOT branch
> (but the modifications do not touch the scheduler, and are irrelevant
> for these particular tests). Has something changed radically in 1.6+
> that would make a previously (<=1.5) correct configuration go haywire?
> Have new configuration settings been added of which I'm unaware that
> could lead to this problem?
>
> Please let me know if others in the community have observed this, and
> thank you for your time,
> Mike
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: multiple splits fails

2016-04-03 Thread Ted Yu
Since num is an Int, you can specify Integer.MAX_VALUE

FYI

On Sun, Apr 3, 2016 at 10:58 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Ted.
>
> As I see print() materializes the first 10 rows. On the other hand
> print(n) will materialise n rows.
>
> How about if I wanted to materialize all rows?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 18:05, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Mich:
>> See the following method of DStream:
>>
>>* Print the first num elements of each RDD generated in this DStream.
>> This is an output
>>* operator, so this DStream will be registered as an output stream and
>> there materialized.
>>*/
>>   def print(num: Int): Unit = ssc.withScope {
>>
>> On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> However this works. I am checking the logic to see if it does what I
>>> asked it to do
>>>
>>> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
>>> INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
>>> 1)).reduceByKey(_ + _).print
>>>
>>> scala> ssc.start()
>>>
>>> scala> ---
>>> Time: 1459701925000 ms
>>> ---
>>> (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
>>> databases,27)
>>> (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
>>> (Once databases are loaded to ASE 15, then you will need to maintain
>>> them the way you maintain your PROD. For example run UPDATE INDEX
>>> STATISTICS and REORG COMPACT as necessary. One of the frequent mistakes
>>> that people do is NOT pruning data from daily log tables in ASE 15 etc as
>>> they do it in PROD. This normally results in slower performance on ASE 15
>>> databases as test cycles continue. Use MDA readings to measure daily DML
>>> activities on ASE 15 tables and compare them with those of PROD. A 24 hour
>>> cycle measurement should be good. If you notice that certain tables have
>>> different DML hits (insert/update/delete) compared to PROD you will know
>>> that either ASE 15 is not doing everything in terms of batch activity (some
>>> jobs are missing), or there is something inconsistent somewhere. ,27)
>>> (* Make sure that you have enough tempdb system segment space for UPDATE
>>> INDEX STATISTICS. It is always advisable to gauge the tempdb size required
>>> in ASE 15 QA and expand the tempdb database in production accordingly. The
>>> last thing you want is to blow up tempdb over the migration weekend.,27)
>>> (o In ASE 15 you can subdivide the task by running parallel UPDATE INDEX
>>> STATISTICS on different tables in the same database at the same time. Watch
>>> tempdb segment growth though! OR,27)
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 April 2016 at 17:06, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ted
>>>>
>>>> This works
>>>>
>>>> scala> val messages = KafkaUtils.createDirectStream[String, String,
>>>> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
>>>> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
>>>> String)] = 
>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063
>>>>
>>>> scala> // Get the lines
>>>> scala> val lines = messages.map(_._2)
>>>> lines: org.apache.spark.streaming.dstream.DStream[String] =
>>>> org.apache.spark.streaming.dstream.MappedDStream@1e4afd64
>>>>
>>>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>>>> contains(&quo

Re: multiple splits fails

2016-04-03 Thread Ted Yu
Mich:
See the following method of DStream:

   * Print the first num elements of each RDD generated in this DStream.
This is an output
   * operator, so this DStream will be registered as an output stream and
there materialized.
   */
  def print(num: Int): Unit = ssc.withScope {

On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> However this works. I am checking the logic to see if it does what I asked
> it to do
>
> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE INDEX
> STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
> 1)).reduceByKey(_ + _).print
>
> scala> ssc.start()
>
> scala> ---
> Time: 1459701925000 ms
> ---
> (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
> databases,27)
> (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
> (Once databases are loaded to ASE 15, then you will need to maintain them
> the way you maintain your PROD. For example run UPDATE INDEX STATISTICS and
> REORG COMPACT as necessary. One of the frequent mistakes that people do is
> NOT pruning data from daily log tables in ASE 15 etc as they do it in PROD.
> This normally results in slower performance on ASE 15 databases as test
> cycles continue. Use MDA readings to measure daily DML activities on ASE 15
> tables and compare them with those of PROD. A 24 hour cycle measurement
> should be good. If you notice that certain tables have different DML hits
> (insert/update/delete) compared to PROD you will know that either ASE 15 is
> not doing everything in terms of batch activity (some jobs are missing), or
> there is something inconsistent somewhere. ,27)
> (* Make sure that you have enough tempdb system segment space for UPDATE
> INDEX STATISTICS. It is always advisable to gauge the tempdb size required
> in ASE 15 QA and expand the tempdb database in production accordingly. The
> last thing you want is to blow up tempdb over the migration weekend.,27)
> (o In ASE 15 you can subdivide the task by running parallel UPDATE INDEX
> STATISTICS on different tables in the same database at the same time. Watch
> tempdb segment growth though! OR,27)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 17:06, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Thanks Ted
>>
>> This works
>>
>> scala> val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
>> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
>> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063
>>
>> scala> // Get the lines
>> scala> val lines = messages.map(_._2)
>> lines: org.apache.spark.streaming.dstream.DStream[String] =
>> org.apache.spark.streaming.dstream.MappedDStream@1e4afd64
>>
>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
>> v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
>> org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d
>>
>> However, this fails
>>
>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>> :43: error: value collect is not a member of
>> org.apache.spark.streaming.dstream.DStream[(String, Int)]
>>  val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 16:01, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> 

Re: multiple splits fails

2016-04-03 Thread Ted Yu
bq. is not a member of (String, String)

As shown above, contains shouldn't be applied directly on a tuple.

Choose the element of the tuple and then apply contains on it.

On Sun, Apr 3, 2016 at 7:54 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thank you gents.
>
> That should "\n" as carriage return
>
> OK I am using spark streaming to analyse the message
>
> It does the streaming
>
> import _root_.kafka.serializer.StringDecoder
> import org.apache.spark.SparkConf
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka.KafkaUtils
> //
> scala> val sparkConf = new SparkConf().
>  |  setAppName("StreamTest").
>  |  setMaster("local[12]").
>  |  set("spark.driver.allowMultipleContexts", "true").
>  |  set("spark.hadoop.validateOutputSpecs", "false")
> scala> val ssc = new StreamingContext(sparkConf, Seconds(55))
> scala>
> scala> val kafkaParams = Map[String, String]("bootstrap.servers" ->
> "rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
> "zookeeper.connect" -> "rhes564:2181", "group.id" -> "StreamTest" )
> kafkaParams: scala.collection.immutable.Map[String,String] =
> Map(bootstrap.servers -> rhes564:9092, schema.registry.url ->
> http://rhes564:8081, zookeeper.connect -> rhes564:2181, group.id ->
> StreamTest)
> scala> val topic = Set("newtopic")
> topic: scala.collection.immutable.Set[String] = Set(newtopic)
> scala> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@5d8ccb6c
>
> This part is tricky
>
> scala> val showlines = messages.filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
> :47: error: value contains is not a member of (String, String)
>  val showlines = messages.filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
>
>
> How does one refer to the content of the stream here?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
> //
> // Now want to do some analysis on the same text file
> //
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 15:32, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. split"\t," splits the filter by carriage return
>>
>> Minor correction: "\t" denotes tab character.
>>
>> On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas <elir...@iguaz.io> wrote:
>>
>>> Hi Mich,
>>>
>>> 1. The first underscore in your filter call is refering to a line in the
>>> file (as textFile() results in a collection of strings)
>>> 2. You're correct. No need for it.
>>> 3. Filter is expecting a Boolean result. So you can merge your contains
>>> filters to one with AND (&&) statement.
>>> 4. Correct. Each character in split() is used as a divider.
>>>
>>> Eliran Bivas
>>>
>>> *From:* Mich Talebzadeh <mich.talebza...@gmail.com>
>>> *Sent:* Apr 3, 2016 15:06
>>> *To:* Eliran Bivas
>>> *Cc:* user @spark
>>> *Subject:* Re: multiple splits fails
>>>
>>> Hi Eliran,
>>>
>>> Many thanks for your input on this.
>>>
>>> I thought about what I was trying to achieve so I rewrote the logic as
>>> follows:
>>>
>>>
>>>1. Read the text file in
>>>2. Filter out empty lines (well not really needed here)
>>>3. Search for lines that contain "ASE 15" and further have sentence
>>>"UPDATE INDEX STATISTICS" in the said line
>>>4. Split the text by "\t" and ","
>>>5. Print the outcome
>>>
>>>
>>> This was what I did with your suggestions included
>>>
>&

Re: multiple splits fails

2016-04-03 Thread Ted Yu
bq. split"\t," splits the filter by carriage return

Minor correction: "\t" denotes tab character.

On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas  wrote:

> Hi Mich,
>
> 1. The first underscore in your filter call is refering to a line in the
> file (as textFile() results in a collection of strings)
> 2. You're correct. No need for it.
> 3. Filter is expecting a Boolean result. So you can merge your contains
> filters to one with AND (&&) statement.
> 4. Correct. Each character in split() is used as a divider.
>
> Eliran Bivas
>
> *From:* Mich Talebzadeh 
> *Sent:* Apr 3, 2016 15:06
> *To:* Eliran Bivas
> *Cc:* user @spark
> *Subject:* Re: multiple splits fails
>
> Hi Eliran,
>
> Many thanks for your input on this.
>
> I thought about what I was trying to achieve so I rewrote the logic as
> follows:
>
>
>1. Read the text file in
>2. Filter out empty lines (well not really needed here)
>3. Search for lines that contain "ASE 15" and further have sentence
>"UPDATE INDEX STATISTICS" in the said line
>4. Split the text by "\t" and ","
>5. Print the outcome
>
>
> This was what I did with your suggestions included
>
> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
> f.cache()
>  f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
>
>
> Couple of questions if I may
>
>
>1. I take that "_" refers to content of the file read in by default?
>2. _.length > 0 basically filters out blank lines (not really needed
>here)
>3. Multiple filters are needed for each *contains* logic
>4. split"\t," splits the filter by carriage return AND ,?
>
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 12:35, Eliran Bivas  wrote:
>
>> Hi Mich,
>>
>> Few comments:
>>
>> When doing .filter(_ > “”) you’re actually doing a lexicographic
>> comparison and not filtering for empty lines (which could be achieved with
>> _.notEmpty or _.length > 0).
>> I think that filtering with _.contains should be sufficient and the
>> first filter can be omitted.
>>
>> As for line => line.split(“\t”).split(“,”):
>> You have to do a second map or (since split() requires a regex as input)
>> .split(“\t,”).
>> The problem is that your first split() call will generate an Array and
>> then your second call will result in an error.
>> e.g.
>>
>> val lines: Array[String] = line.split(“\t”)
>> lines.split(“,”) // Compilation error - no method split() exists for Array
>>
>> So either go with map(_.split(“\t”)).map(_.split(“,”)) or
>> map(_.split(“\t,”))
>>
>> Hope that helps.
>>
>> *Eliran Bivas*
>> Data Team | iguaz.io
>>
>>
>> On 3 Apr 2016, at 13:31, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I am not sure this is the correct approach
>>
>> Read a text file in
>>
>> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>>
>>
>> Now I want to get rid of empty lines and filter only the lines that
>> contain "ASE15"
>>
>>  f.filter(_ > "").filter(_ contains("ASE15")).
>>
>> The above works but I am not sure whether I need two filter
>> transformation above? Can it be done in one?
>>
>> Now I want to map the above filter to lines with carriage return ans
>> split them by ","
>>
>> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t")))
>> res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
>> map at :30
>>
>> Now I want to split the output by ","
>>
>> scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>> :30: error: value split is not a member of Array[String]
>>   f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>>
>> ^
>> Any advice will be appreciated
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>


Re: Multiple lookups; consolidate result and run further aggregations

2016-04-02 Thread Ted Yu
Looking at the implementation for lookup in PairRDDFunctions, I think your
understanding is correct.


On Sat, Apr 2, 2016 at 3:16 AM, Nirav Patel  wrote:

> I will start by question: Is spark lookup function on pair rdd is a driver
> action. ie result is returned to driver?
>
> I have list of Keys on driver side and I want to perform multiple parallel
> lookups on pair rdd which returns Seq[V]; consolidate results; and perform
> further aggregation/transformation over cluster.
>
> val seqVal = lookupKeys.flatMap(key => {
>
> dataRdd.lookup(key)
>
>   })
>
>
> Here's what I think will happen internally:
>
> lookup up for Seq[V]  return result to driver
>
> Consolidation of each Seq[v] has to happen on driver due to flatMap
> function
>
> All subsequent operation will happen on driver side unless I do
> sparkContext.parallelize(seqVal)
>
> Is this correct?
>
> Also, what I am trying to do is efficient multiple lookup. Another option
> is to broadcast lookup keys and perform join.
>
> Please advice.
>
> Thanks
> Nirav
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


Re: Scala: Perform Unit Testing in spark

2016-04-02 Thread Ted Yu
I think you should specify dependencies in this way:

*"org.apache.spark" % "spark-core_2.10" % "1.6.0"* % "tests"

Please refer to http://www.scalatest.org/user_guide/using_scalatest_with_sbt

On Fri, Apr 1, 2016 at 3:33 PM, Shishir Anshuman <shishiranshu...@gmail.com>
wrote:

> When I added *"org.apache.spark" % "spark-core_2.10" % "1.6.0",  *it
> should include spark-core_2.10-1.6.1-tests.jar.
> Why do I need to use the jar file explicitly?
>
> And how do I use the jars for compiling with *sbt* and running the tests
> on spark?
>
>
> On Sat, Apr 2, 2016 at 3:46 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> You need to include the following jars:
>>
>> jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite
>>   1787 Thu Mar 03 09:06:14 PST 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
>>   1780 Thu Mar 03 09:06:14 PST 2016
>> org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
>>   3982 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite.class
>>
>> jar tvf ./mllib/target/spark-mllib_2.10-1.6.1-tests.jar | grep
>> MLlibTestSparkContext
>>   1447 Thu Mar 03 09:53:54 PST 2016
>> org/apache/spark/mllib/util/MLlibTestSparkContext.class
>>   1704 Thu Mar 03 09:53:54 PST 2016
>> org/apache/spark/mllib/util/MLlibTestSparkContext$class.class
>>
>> On Fri, Apr 1, 2016 at 3:07 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> I got the file ALSSuite.scala and trying to run it. I have copied the
>>> file under *src/test/scala *in my project folder. When I run *sbt test*,
>>> I get errors. I have attached the screenshot of the errors. Befor *sbt
>>> test*, I am building the package with *sbt package*.
>>>
>>> Dependencies of *simple.sbt*:
>>>
>>>>
>>>>
>>>>
>>>>
>>>> *libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" %
>>>> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )*
>>>
>>>
>>>
>>>
>>> On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Assuming your code is written in Scala, I would suggest using
>>>> ScalaTest.
>>>>
>>>> Please take a look at the XXSuite.scala files under mllib/
>>>>
>>>> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
>>>> shishiranshu...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a code written in scala using Mllib. I want to perform unit
>>>>> testing it. I cant decide between Junit 4 and ScalaTest.
>>>>> I am new to Spark. Please guide me how to proceed with the testing.
>>>>>
>>>>> Thank you.
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Problem with jackson lib running on spark

2016-04-01 Thread Ted Yu
Thanks for sharing the workaround.

Probably send a PR on tranquilizer github :-)

On Fri, Apr 1, 2016 at 12:50 PM, Marcelo Oikawa  wrote:

> Hi, list.
>
> Just to close the thread. Unfortunately, I didnt solve the jackson lib
> problem but I did a workaround that works fine for me. Perhaps this help
> another one.
>
> The problem raised from this line when I try to create tranquilizer object
> (used to connect to Druid) using this utility *fromConfig*.
>
> tranquilizer = 
> DruidBeams.fromConfig(dataSourceConfig).buildTranquilizer(tranquilizerBuider);
>
> Instead, I create the tranquilizer object-by-object as showed below:
>
> DruidDimensions dimentions = getDimensions(props);
> List aggregators = getAggregations(props);
>
> TimestampSpec timestampSpec = new TimestampSpec("timestamp", "auto", null);
> Timestamper> timestamper = map -> new 
> DateTime(map.get("timestamp"));
> DruidLocation druidLocation = DruidLocation.create("overlord", 
> "druid:firehose:%s", dataSource);
> DruidRollup druidRollup = DruidRollup.create(dimentions, aggregators, 
> QueryGranularity.ALL);
>
>
> ClusteredBeamTuning clusteredBeamTuning = ClusteredBeamTuning.builder()
> 
> .segmentGranularity(Granularity.FIVE_MINUTE)
> .windowPeriod(new 
> Period("PT60m"))
> .partitions(1)
> .replicants(1)
> .build();
>
> tranquilizer = DruidBeams.builder(timestamper)
>.curator(buildCurator(props))
>.discoveryPath("/druid/discovery")
>.location(druidLocation)
>.timestampSpec(timestampSpec)
>.rollup(druidRollup)
>.tuning(clusteredBeamTuning)
>.buildTranquilizer();
>
> tranquilizer.start();
>
> That worked for me. Thank you Ted, Alonso and other users.
>
>
> On Thu, Mar 31, 2016 at 4:08 PM, Marcelo Oikawa <
> marcelo.oik...@webradar.com> wrote:
>
>>
>> Please exclude jackson-databind - that was where the AnnotationMap class
>>> comes from.
>>>
>>
>> I tried as you suggest but i getting the same error. Seems strange
>> because when I see the generated jar there is nothing related as
>> AnnotationMap but there is a databind there.
>>
>>
>> ​
>>
>>
>>>
>>> On Thu, Mar 31, 2016 at 11:37 AM, Marcelo Oikawa <
>>> marcelo.oik...@webradar.com> wrote:
>>>
 Hi, Alonso.

 As you can see jackson-core is provided by several libraries, try to
> exclude it from spark-core, i think the minor version is included within
> it.
>

 There is no more than one jackson-core provides by spark-core. There
 are jackson-core and jackson-core-asl but are differents artifacts. BTW, I
 tried to exclude then but no sucess. Same error:

 java.lang.IllegalAccessError: tried to access method
 com.fasterxml.jackson.databind.introspect.AnnotatedMember.getAllAnnotations()Lcom/fasterxml/jackson/databind/introspect/AnnotationMap;
 from class
 com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector
 at
 com.fasterxml.jackson.databind.introspect.GuiceAnnotationIntrospector.findInjectableValueId(GuiceAnnotationIntrospector.java:39)
 at
 com.fasterxml.jackson.databind.introspect.AnnotationIntrospectorPair.findInjectableValueId(AnnotationIntrospectorPair.java:269)
 at
 com.fasterxml.jackson.databind.deser.BasicDeserializerFactory._addDeserializerConstructors(BasicDeserializerFactory.java:433)
 ...

 I guess the problem is incopabilities between jackson artifacts that
 comes from tranquility dependency vs spark prodided but I also tried to
 find same jackson artifacts but in different versions but there is no one.
 What is missing?


 Use this guide to see how to do it:
>
>
> https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html
>
>
>
> Alonso Isidoro Roman.
>
> Mis citas preferidas (de hoy) :
> "Si depurar es el proceso de quitar los errores de software, entonces
> programar debe ser el proceso de introducirlos..."
>  -  Edsger Dijkstra
>
> My favorite quotes (today):
> "If debugging is the process of removing software bugs, then
> programming must be the process of putting ..."
>   - Edsger Dijkstra
>
> "If you pay peanuts you get monkeys"
>
>
> 2016-03-31 20:01 GMT+02:00 Marcelo Oikawa  >:
>
>> Hey, Alonso.
>>
>> here is the output:
>>
>> [INFO] spark-processor:spark-processor-druid:jar:1.0-SNAPSHOT
>> [INFO] +- org.apache.spark:spark-streaming_2.10:jar:1.6.1:provided
>> [INFO] |  +- 

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
You need to include the following jars:

jar tvf ./core/target/spark-core_2.10-1.6.1-tests.jar | grep SparkFunSuite
  1787 Thu Mar 03 09:06:14 PST 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$1.class
  1780 Thu Mar 03 09:06:14 PST 2016
org/apache/spark/SparkFunSuite$$anonfun$withFixture$2.class
  3982 Thu Mar 03 09:06:14 PST 2016 org/apache/spark/SparkFunSuite.class

jar tvf ./mllib/target/spark-mllib_2.10-1.6.1-tests.jar | grep
MLlibTestSparkContext
  1447 Thu Mar 03 09:53:54 PST 2016
org/apache/spark/mllib/util/MLlibTestSparkContext.class
  1704 Thu Mar 03 09:53:54 PST 2016
org/apache/spark/mllib/util/MLlibTestSparkContext$class.class

On Fri, Apr 1, 2016 at 3:07 PM, Shishir Anshuman <shishiranshu...@gmail.com>
wrote:

> I got the file ALSSuite.scala and trying to run it. I have copied the file
> under *src/test/scala *in my project folder. When I run *sbt test*, I get
> errors. I have attached the screenshot of the errors. Befor *sbt test*, I
> am building the package with *sbt package*.
>
> Dependencies of *simple.sbt*:
>
>>
>>
>>
>>
>> *libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" %
>> "1.6.0", "org.apache.spark" % "spark-mllib_2.10" % "1.6.0" )*
>
>
>
>
> On Sat, Apr 2, 2016 at 2:21 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Assuming your code is written in Scala, I would suggest using ScalaTest.
>>
>> Please take a look at the XXSuite.scala files under mllib/
>>
>> On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman <
>> shishiranshu...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have a code written in scala using Mllib. I want to perform unit
>>> testing it. I cant decide between Junit 4 and ScalaTest.
>>> I am new to Spark. Please guide me how to proceed with the testing.
>>>
>>> Thank you.
>>>
>>
>>
>


Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Ted Yu
Assuming your code is written in Scala, I would suggest using ScalaTest.

Please take a look at the XXSuite.scala files under mllib/

On Fri, Apr 1, 2016 at 1:31 PM, Shishir Anshuman 
wrote:

> Hello,
>
> I have a code written in scala using Mllib. I want to perform unit testing
> it. I cant decide between Junit 4 and ScalaTest.
> I am new to Spark. Please guide me how to proceed with the testing.
>
> Thank you.
>


Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
Please read
https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties
w.r.t. spark-defaults.conf

On Fri, Apr 1, 2016 at 12:06 PM, Max Schmidt <m...@datapath.io> wrote:

> Yes but doc doesn't say any word for which variable the configs are valid,
> so do I have to set them for the history-server? The daemon? The workers?
>
> And what if I use the java API instead of spark-submit for the jobs?
>
> I guess that the spark-defaults.conf are obsolete for the java API?
>
>
> Am 2016-04-01 18:58, schrieb Ted Yu:
>
>> You can set them in spark-defaults.conf
>>
>> See also https://spark.apache.org/docs/latest/configuration.html#spark-ui
>> [1]
>>
>> On Fri, Apr 1, 2016 at 8:26 AM, Max Schmidt <m...@datapath.io> wrote:
>>
>> Can somebody tell me the interaction between the properties:
>>>
>>> spark.ui.retainedJobs
>>> spark.ui.retainedStages
>>> spark.history.retainedApplications
>>>
>>> I know from the bugtracker, that the last one describes the number of
>>> applications the history-server holds in memory.
>>>
>>> Can I set the properties in the spark-env.sh? And where?
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>>
>> Links:
>> --
>> [1] https://spark.apache.org/docs/latest/configuration.html#spark-ui
>>
>
>
>
>


Re: OutOfMemory with wide (289 column) dataframe

2016-04-01 Thread Ted Yu
bq. This was a big help!

The email (maybe only addressed to you) didn't come with your latest reply.

Do you mind sharing it ?

Thanks

On Fri, Apr 1, 2016 at 11:37 AM, ludflu  wrote:

> This was a big help! For the benefit of my fellow travelers running spark
> on
> EMR:
>
> I made a json file with the following:
>
> [ { "Classification": "yarn-site", "Properties": {
> "yarn.nodemanager.pmem-check-enabled": "false",
> "yarn.nodemanager.vmem-check-enabled": "false" } } ]
>
> and then I created my cluster like so:
>
> aws emr create-cluster --configurations
> file:///Users/jsnavely/project/frick/spark_config/nomem.json
> ...
>
> The other thing I noticed was that one of the dataframes I was joining
> against was actually coming from
> a gzip'd json file. gzip files are NOT splittable, so it wasn't properly
> parallelized, which means that the join were causing alot of memory
> pressure. I recompressed it was bzip2 and my job has been running with no
> errors.
>
> Thanks again!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-with-wide-289-column-dataframe-tp26651p26660.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Where to set properties for the retainedJobs/Stages?

2016-04-01 Thread Ted Yu
You can set them in spark-defaults.conf

See also https://spark.apache.org/docs/latest/configuration.html#spark-ui

On Fri, Apr 1, 2016 at 8:26 AM, Max Schmidt  wrote:

> Can somebody tell me the interaction between the properties:
>
> spark.ui.retainedJobs
> spark.ui.retainedStages
> spark.history.retainedApplications
>
> I know from the bugtracker, that the last one describes the number of
> applications the history-server holds in memory.
>
> Can I set the properties in the spark-env.sh? And where?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


<    1   2   3   4   5   6   7   8   9   10   >