Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
Below ingestion rate is actually when I am using a bactch size of 10mb, 10 
records. I have tried with 20-50 partitions, higher partitions give bulk queue 
exceptions.

 Anyways thanks for suggestion I would appreciate more inputs, specifically on 
cluster design.

Rohit
> On Dec 22, 2016, at 11:31 PM, genia...@gmail.com  wrote:
> 
> One thing I will look at is how many partitions your dataset has before 
> writing to ES using Spark. As it may be the limiting factor to your parallel 
> writing. 
> 
> You can also tune the batch size on ES writes...
> 
> One more thing, make sure you have enough network bandwidth...
> 
> Regards,
> 
> Yang
> 
> Sent from my iPhone
> 
>> On Dec 22, 2016, at 12:35 PM, Rohit Verma  wrote:
>> 
>> I am setting up a spark cluster. I have hdfs data nodes and spark master 
>> nodes on same instances. To add elasticsearch to this cluster, should I 
>> spawn es on different machine on same machine. I have only 12 machines, 
>> 1-master (spark and hdfs)
>> 8-spark workers and hdfs data nodes
>> I can use 3 nodes for es dedicatedly or can use 11 nodes running all three.
>> 
>> All instances are same, 16gig dual core (unfortunately). 
>> 
>> Also I am trying with es hadoop, es-spark project but I felt ingestion is 
>> very slow if I do 3 dedicated nodes, its like 0.6 million records/minute. 
>> If any one had experience using that project can you please share your 
>> thoughts about tuning.
>> 
>> Regards
>> Rohit
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9

On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon  wrote:

Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim :

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
them into 1 file after they are output from Spark. Doing a coalesce(1) on
the Spark cluster will not work. It just does not have the resources to do
it. I'm trying to do it using the commandline and not use Spark. I will use
this command in shell script. I tried "hdfs dfs -getmerge", but the file
becomes unreadable by Spark with gzip footer error.





Thanks,


Ben


-


To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
Hello Spark Community,

For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
then submit to spark-submit.

Example,

bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
/home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar

But other folks has debate with for Uber Less Jar, Guys can you please
explain me best practice industry standard for the same.

Thanks,

Chetan Khatri.


Can't access the data in Kafka Spark Streaming globally

2016-12-22 Thread Sree Eedupuganti
I am trying to stream the data from Kafka to Spark.

JavaPairInputDStream directKafkaStream =
KafkaUtils.createDirectStream(ssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams, topics);

Here i am iterating over the JavaPairInputDStream to process the RDD's.

directKafkaStream.foreachRDD(rdd ->{
rdd.foreachPartition(items ->{
while (items.hasNext()) {
String[] State = items.next()._2.split("\\,");
System.out.println(State[2]+","+State[3]+","+State[4]+"--");
};
});
});


In this i can able to access the String Array but when i am trying to
access the String Array data globally i can't access the data. Here my
requirement is if i had access these data globally i had another lookup
table in Hive. So i am trying to perform an operation on these. Any
suggestions please, Thanks.

-- 
Best Regards,
Sreeharsha Eedupuganti


Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t 
work, I’ll try Kite.

Cheers,
Ben


> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon  wrote:
> 
> Hi Benjamin,
> 
> 
> As you might already know, I believe the Hadoop command automatically does 
> not merge the column-based format such as ORC or Parquet but just simply 
> concatenates them.
> 
> I haven't tried this by myself but I remember I saw a JIRA in Parquet - 
> https://issues.apache.org/jira/browse/PARQUET-460 
> 
> 
> It seems parquet-tools allows merge small Parquet files into one. 
> 
> 
> Also, I believe there are command-line tools in Kite - 
> https://github.com/kite-sdk/kite 
> 
> This might be useful.
> 
> 
> Thanks!
> 
> 2016-12-23 7:01 GMT+09:00 Benjamin Kim  >:
> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
> into 1 file after they are output from Spark. Doing a coalesce(1) on the 
> Spark cluster will not work. It just does not have the resources to do it. 
> I'm trying to do it using the commandline and not use Spark. I will use this 
> command in shell script. I tried "hdfs dfs -getmerge", but the file becomes 
> unreadable by Spark with gzip footer error.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: Merging Parquet Files

2016-12-22 Thread Hyukjin Kwon
Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim :

> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
> them into 1 file after they are output from Spark. Doing a coalesce(1) on
> the Spark cluster will not work. It just does not have the resources to do
> it. I'm trying to do it using the commandline and not use Spark. I will use
> this command in shell script. I tried "hdfs dfs -getmerge", but the file
> becomes unreadable by Spark with gzip footer error.
>
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


答复: submit spark task on yarn asynchronously via java?

2016-12-22 Thread Linyuxin
Hi,
Could Anybody help?

发件人: Linyuxin
发送时间: 2016年12月22日 14:18
收件人: user 
主题: submit spark task on yarn asynchronously via java?

Hi All,

Version:
Spark 1.5.1
Hadoop 2.7.2

Is there any way to submit and monitor spark task on yarn via java 
asynchronously?




Re: streaming performance

2016-12-22 Thread Tathagata Das
>From what I understand looking at the code in stackoverflow, I think you
are "simulating" the streaming version of your calculation incorrectly. You
are repeatedly unioning batch dataframes to simulate streaming and then
applying aggregation on the unioned DF. That will not going to compute
aggregates incrementally, it will just process the whole data every time.
So the oldest batch DF again and again, causing an increasing resource
usage. Thats not how streaming works, so this is not simulating the right
thing.

With Structured Streaming's streaming dataframes, it is actually done
incrementally. The best way to try that is to generate a file per "bucket",
and then create a streaming dataframe on the files such that they are one
by one. See this notebook for the maxFilesPerTrigger option in
http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Structured%20Streaming%20using%20Scala%20DataFrames%20API.html

This would process each file one by one, maintain internal state to
continuous update the aggregates and never require reprocessing the old
data.

Hope this helps



On Wed, Dec 21, 2016 at 7:58 AM, Mendelson, Assaf 
wrote:

> am having trouble with streaming performance. My main problem is how to do
> a sliding window calculation where the ratio between the window size and
> the step size is relatively large (hundreds) without recalculating
> everything all the time.
>
> I created a simple example of what I am aiming at with what I have so far
> which is detailed in http://stackoverflow.com/questions/41266956/apache-
> spark-streaming-performance
>
> I was hoping someone can point me to what I am doing wrong.
>
> Thanks,
>
> Assaf.
>


Re: Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Daniel Siegmann
Spark 2.0.0 introduced "Automatic file coalescing for native data sources" (
http://spark.apache.org/releases/spark-release-2-0-0.html#performance-and-runtime).
Perhaps that is the cause?

I'm not sure if this feature is mentioned anywhere in the documentation or
if there's any way to disable it.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Thu, Dec 22, 2016 at 11:09 AM, Kristina Rogale Plazonic  wrote:

> Hi,
>
> I write a randomly generated 30,000-row dataframe to parquet. I verify
> that it has 200 partitions (both in Spark and inspecting the parquet file
> in hdfs).
>
> When I read it back in, it has 23 partitions?! Is there some optimization
> going on? (This doesn't happen in Spark 1.5)
>
> *How can I force it to read back the same partitions i.e. 200?* I'm
> trying to reproduce a problem that depends on partitioning and can't
> because the number of partitions goes way down.
>
> Thanks for any insights!
> Kristina
>
> Here is the code and output:
>
> scala> spark.version
> res13: String = 2.0.2
>
> scala> df.show(2)
> +---+---+--+--+--+--
> +++
> | id|id2|  strfeat0|  strfeat1|  strfeat2|
>  strfeat3|binfeat0|binfeat1|
> +---+---+--+--+--+--
> +++
> |  0|12345678901 <(234)%20567-8901>|fcvEmHTZte|
>  null|fnuAQdnBkJ|aU3puFMq5h|   1|   1|
> |  1|12345678902 <(234)%20567-8902>|  
> null|rtcrPaAVNX|fnuAQdnBkJ|x6NyoX662X|
>   0|   0|
> +---+---+--+--+--+--
> +++
> only showing top 2 rows
>
>
> scala> df.count
> res15: Long = 30001
>
> scala> df.rdd.partitions.size
> res16: Int = 200
>
> scala> df.write.parquet("/tmp/df")
>
>
> scala> val newdf = spark.read.parquet("/tmp/df")
> newdf: org.apache.spark.sql.DataFrame = [id: int, id2: bigint ... 6 more
> fields]
>
> scala> newdf.rdd.partitions.size
> res18: Int = 23
>
>
> [kris@airisdata195 ~]$ hdfs dfs -ls /tmp/df
> Found 201 items
> -rw-r--r--   3 kris supergroup  0 2016-12-22 11:01 /tmp/df/_SUCCESS
> -rw-r--r--   3 kris supergroup   4974 2016-12-22 11:01
> /tmp/df/part-r-0-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> -rw-r--r--   3 kris supergroup   4914 2016-12-22 11:01
> /tmp/df/part-r-1-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> .
> . (omitted output)
> .
> -rw-r--r--   3 kris supergroup   4893 2016-12-22 11:01
> /tmp/df/part-r-00198-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
> -rw-r--r--   3 kris supergroup   4981 2016-12-22 11:01
> /tmp/df/part-r-00199-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
>
>


Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark 
cluster will not work. It just does not have the resources to do it. I'm trying 
to do it using the commandline and not use Spark. I will use this command in 
shell script. I tried "hdfs dfs -getmerge", but the file becomes unreadable by 
Spark with gzip footer error.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Jörn Franke
Why not upgrade to ojdbc7 - this one is for java 7+8? Keep in mind that the 
jdbc driver is updated constantly (although simply called ojdbc7). I would be 
surprised if this does not work with cloudera as it runs on the oracle big data 
appliance.

> On 22 Dec 2016, at 21:44, Mich Talebzadeh  wrote:
> 
> Thanks all sorted. The admin had updated these lines incorrectly in 
> $SPARK_HOME/conf/spark-defaults.conf by updating one of the parameters only 
> for Oracle ojdbc6.jar and the other one for Sybase jconn4.jar!
> 
>  
> spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar
> spark.executor.extraClassPath/home/hduser/jars/jconn4.jar
> 
> Also noticed that although we could see Oracle table metadata once connected 
> no action such as collecting data could be done. It was crashing.
> 
> So decided to downgrade ojdbc6.jar to ojdbc5.jar and that worked with Spark. 
> Interestingly Spoop has no problem using ojdbc6.jar. FYI the Oracle database 
> version accessed is 11g, R2
> 
>  Also it is a challenge in a multi-talented cluster to maintain multiple 
> versions of jars for the same database type through 
> $SPARK_HOME/conf/spark-defaults.conf!
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 22 December 2016 at 10:44, Alexander Kapustin  wrote:
>> Hello,
>> 
>>  
>> 
>> For Spark 1.5 (and for 1.6) we use Oracle jdbc via spark-submit.sh --jars 
>> /path/to/ojdbc6.jar
>> 
>> Also we use additional oracle driver properties via --driver-java-options
>> 
>>  
>> 
>> Sent from Mail for Windows 10
>> 
>>  
>> 
>> From: Mich Talebzadeh
>> Sent: 22 декабря 2016 г. 0:40
>> To: user @spark
>> Subject: Has anyone managed to connect to Oracle via JDBC from Spark CDH 
>> 5.5.2
>> 
>>  
>> 
>> This works with Spark 2 with Oracle jar file added to
>> 
>> $SPARK_HOME/conf/ spark-defaults.conf 
>>  
>>  
>> spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar
>> spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar
>>  
>> and you get
>>  cala> val s = HiveContext.read.format("jdbc").options(
>>  | Map("url" -> _ORACLEserver,
>>  | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS 
>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS 
>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>  | "partitionColumn" -> "ID",
>>  | "lowerBound" -> "1",
>>  | "upperBound" -> "1",
>>  | "numPartitions" -> "10",
>>  | "user" -> _username,
>>  | "password" -> _password)).load
>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5 
>> more fields]
>> 
>> that works.
>> However, with CDH 5.5.2 (Spark 1.5) it fails with error
>> 
>> java.sql.SQLException: No suitable driver
>>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
>>   at 
>> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
>>   at 
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>> 
>> 
>> Any ideas?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 


Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Mich Talebzadeh
Thanks all sorted. The admin had updated these lines incorrectly in
$SPARK_HOME/conf/spark-defaults.conf by updating one of the parameters only
for Oracle ojdbc6.jar and the other one for Sybase jconn4.jar!


spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar

spark.executor.extraClassPath/home/hduser/jars/jconn4.jar


Also noticed that although we could see Oracle table metadata once
connected no action such as collecting data could be done. It was crashing.


So decided to downgrade ojdbc6.jar to ojdbc5.jar and that worked with
Spark. Interestingly Spoop has no problem using ojdbc6.jar. FYI the Oracle
database version accessed is 11g, R2


 Also it is a challenge in a multi-talented cluster to maintain multiple
versions of jars for the same database type through $SPARK_HOME/conf/
spark-defaults.conf!


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 22 December 2016 at 10:44, Alexander Kapustin  wrote:

> Hello,
>
>
>
> For Spark 1.5 (and for 1.6) we use Oracle jdbc via spark-submit.sh --jars
> /path/to/ojdbc6.jar
>
> Also we use additional oracle driver properties via --driver-java-options
>
>
>
> Sent from Mail  for
> Windows 10
>
>
>
> *From: *Mich Talebzadeh 
> *Sent: *22 декабря 2016 г. 0:40
> *To: *user @spark 
> *Subject: *Has anyone managed to connect to Oracle via JDBC from Spark
> CDH 5.5.2
>
>
> This works with Spark 2 with Oracle jar file added to
>
> $SPARK_HOME/conf/ spark-defaults.conf
>
>
>
>
> spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar
>
> spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar
>
>
> and you get
>
>  cala> val s = HiveContext.read.format("jdbc").options(
>  | Map("url" -> _ORACLEserver,
>  | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>  | "partitionColumn" -> "ID",
>  | "lowerBound" -> "1",
>  | "upperBound" -> "1",
>  | "numPartitions" -> "10",
>  | "user" -> _username,
>  | "password" -> _password)).load
> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5
> more fields]
> that works.
> However, with CDH 5.5.2 (Spark 1.5) it fails with error
>
> *java.sql.SQLException: No suitable driver*
>
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
>
>   at scala.Option.getOrElse(Option.scala:121)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.
> createConnectionFactory(JdbcUtils.scala:53)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.
> JDBCRDD$.resolveTable(JDBCRDD.scala:123)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(
> JDBCRelation.scala:117)
>
>   at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.
> createRelation(JdbcRelationProvider.scala:53)
>
>   at org.apache.spark.sql.execution.datasources.
> DataSource.resolveRelation(DataSource.scala:315)
>
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>
>
> Any ideas?
>
> Thanks
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Spark subscribe

2016-12-22 Thread pradeep s
Hi ,
Can you please add me to spark subscription list.
Regards
Pradeep S


Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread genia...@gmail.com
One thing I will look at is how many partitions your dataset has before writing 
to ES using Spark. As it may be the limiting factor to your parallel writing. 

You can also tune the batch size on ES writes...

One more thing, make sure you have enough network bandwidth...

Regards,

Yang

Sent from my iPhone

> On Dec 22, 2016, at 12:35 PM, Rohit Verma  wrote:
> 
> I am setting up a spark cluster. I have hdfs data nodes and spark master 
> nodes on same instances. To add elasticsearch to this cluster, should I spawn 
> es on different machine on same machine. I have only 12 machines, 
> 1-master (spark and hdfs)
> 8-spark workers and hdfs data nodes
> I can use 3 nodes for es dedicatedly or can use 11 nodes running all three.
> 
> All instances are same, 16gig dual core (unfortunately). 
> 
> Also I am trying with es hadoop, es-spark project but I felt ingestion is 
> very slow if I do 3 dedicated nodes, its like 0.6 million records/minute. 
> If any one had experience using that project can you please share your 
> thoughts about tuning.
> 
> Regards
> Rohit
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes 
on same instances. To add elasticsearch to this cluster, should I spawn es on 
different machine on same machine. I have only 12 machines, 
1-master (spark and hdfs)
8-spark workers and hdfs data nodes
I can use 3 nodes for es dedicatedly or can use 11 nodes running all three.

All instances are same, 16gig dual core (unfortunately). 

Also I am trying with es hadoop, es-spark project but I felt ingestion is very 
slow if I do 3 dedicated nodes, its like 0.6 million records/minute. 
If any one had experience using that project can you please share your thoughts 
about tuning.

Regards
Rohit
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: parsing embedded json in spark

2016-12-22 Thread Shaw Liu
Hi,I guess you can use 'get_json_object' function

Get Outlook for iOS




On Thu, Dec 22, 2016 at 9:52 PM +0800, "Irving Duran"  
wrote:










Is it an option to parse that field prior of creating the dataframe? If so, 
that's what I would do.
In terms of your master node only working, you have to share more about your 
structure, are you using spark standalone, yarn, or mesos?

Thank You,

Irving Duran

On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum  wrote:
Hi, 

I have a dataframe that contain an embedded json string in one of the fieldsI'd 
tried to write a UDF function that will parse it using lift-json, but it seems 
to take a very long time to process, and it seems that only the master node is 
working.
Has anyone dealt with such a scenario before and can give me some hints?
ThanksTal









Why does Spark 2.0 change number or partitions when reading a parquet file?

2016-12-22 Thread Kristina Rogale Plazonic
Hi,

I write a randomly generated 30,000-row dataframe to parquet. I verify that
it has 200 partitions (both in Spark and inspecting the parquet file in
hdfs).

When I read it back in, it has 23 partitions?! Is there some optimization
going on? (This doesn't happen in Spark 1.5)

*How can I force it to read back the same partitions i.e. 200?* I'm trying
to reproduce a problem that depends on partitioning and can't because the
number of partitions goes way down.

Thanks for any insights!
Kristina

Here is the code and output:

scala> spark.version
res13: String = 2.0.2

scala> df.show(2)
+---+---+--+--+--+--+++
| id|id2|  strfeat0|  strfeat1|  strfeat2|
 strfeat3|binfeat0|binfeat1|
+---+---+--+--+--+--+++
|  0|12345678901|fcvEmHTZte|  null|fnuAQdnBkJ|aU3puFMq5h|   1|
  1|
|  1|12345678902|  null|rtcrPaAVNX|fnuAQdnBkJ|x6NyoX662X|   0|
  0|
+---+---+--+--+--+--+++
only showing top 2 rows


scala> df.count
res15: Long = 30001

scala> df.rdd.partitions.size
res16: Int = 200

scala> df.write.parquet("/tmp/df")


scala> val newdf = spark.read.parquet("/tmp/df")
newdf: org.apache.spark.sql.DataFrame = [id: int, id2: bigint ... 6 more
fields]

scala> newdf.rdd.partitions.size
res18: Int = 23


[kris@airisdata195 ~]$ hdfs dfs -ls /tmp/df
Found 201 items
-rw-r--r--   3 kris supergroup  0 2016-12-22 11:01 /tmp/df/_SUCCESS
-rw-r--r--   3 kris supergroup   4974 2016-12-22 11:01
/tmp/df/part-r-0-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
-rw-r--r--   3 kris supergroup   4914 2016-12-22 11:01
/tmp/df/part-r-1-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
.
. (omitted output)
.
-rw-r--r--   3 kris supergroup   4893 2016-12-22 11:01
/tmp/df/part-r-00198-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet
-rw-r--r--   3 kris supergroup   4981 2016-12-22 11:01
/tmp/df/part-r-00199-84584688-612f-49a3-a023-4a5c6d784d96.snappy.parquet


Re: parsing embedded json in spark

2016-12-22 Thread Irving Duran
Is it an option to parse that field prior of creating the dataframe? If so,
that's what I would do.

In terms of your master node only working, you have to share more about
your structure, are you using spark standalone, yarn, or mesos?


Thank You,

Irving Duran

On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum 
wrote:

> Hi,
>
> I have a dataframe that contain an embedded json string in one of the
> fields
> I'd tried to write a UDF function that will parse it using lift-json, but
> it seems to take a very long time to process, and it seems that only the
> master node is working.
>
> Has anyone dealt with such a scenario before and can give me some hints?
>
> Thanks
> Tal
>


RE: spark-shell fails to redefine values

2016-12-22 Thread Spencer, Alex (Santander)
Can you ask for eee inbetween each reassign? The memory address at the end 
1ec5bf62 != 2c6beb3e or 66cb003 – so what’s going on there?


From: Yang [mailto:tedd...@gmail.com]
Sent: 21 December 2016 18:37
To: user 
Subject: spark-shell fails to redefine values

summary: Spark-shell fails to redefine values in some cases, this is at least 
found in a case where "implicit" is involved, but not limited to such cases

run the following in spark-shell, u can see that the last redefinition does not 
take effect. the same code runs in plain scala REPL without problems

scala> class Useless{}
defined class Useless

scala> class Useless1 {}
defined class Useless1

scala> implicit val eee :Useless = new Useless()
eee: Useless = Useless@2c6beb3e

scala> implicit val eee:Useless1 = new Useless1()
eee: Useless1 = Useless1@66cb003

scala> eee
res24: Useless = Useless@1ec5bf62
Emails aren't always secure, and they may be intercepted or changed after 
they've been sent. Santander doesn't accept liability if this happens. If you 
think someone may have interfered with this email, please get in touch with the 
sender another way. This message doesn't create or change any contract. 
Santander doesn't accept responsibility for damage caused by any viruses 
contained in this email or its attachments. Emails may be monitored. If you've 
received this email by mistake, please let the sender know at once that it's 
gone to the wrong person and then destroy it without copying, using, or telling 
anyone about its contents.

Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc 
Reg.No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 
3AN. Registered in England and Wales. www.santander.co.uk. Authorised by the 
Prudential Regulation Authority and regulated by the Financial Conduct 
Authority and the Prudential Regulation Authority. FCA Reg. No. 106054 and 
146003 respectively. Santander Sharedealing is a trading name of Abbey 
Stockbrokers Limited Reg. No.02666793. Registered Office: Kingfisher House, 
Radford Way, Billericay, Essex CM12 0GZ. Authorised and regulated by the 
Financial Conduct Authority. FCA Reg. No. 154210. You can check this on the 
Financial Services Register by visiting the FCA's website 
www.fca.org.uk/register or by contacting the FCA on 0800 111 6768. Santander UK 
plc is also licensed by the Isle of Man Financial Services Authority for its 
branch in the Isle of Man. Deposits held with the Isle of Man branch are 
covered by the Isle of Man Depositors' Compensation Scheme as set out in the 
Isle of Man Depositors' Compensation Scheme Regulations 2010. In the Isle of 
Man, Santander UK plc's principal place of business is at 19/21 Prospect Hill, 
Douglas, Isle of Man, IM1 1ET. Santander and the flame logo are registered 
trademarks. Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 
Triton Square, Regent's Place, London NW1 3AN. Registered in England. Santander 
Corporate & Commercial is a brand name used by Santander UK plc, Abbey National 
Treasury Services plc and Santander Asset Finance plc.

Ref:[PDB#1-4A]


RE: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-22 Thread Alexander Kapustin
Hello,

For Spark 1.5 (and for 1.6) we use Oracle jdbc via spark-submit.sh --jars 
/path/to/ojdbc6.jar
Also we use additional oracle driver properties via --driver-java-options

Sent from Mail for Windows 10

From: Mich Talebzadeh
Sent: 22 декабря 2016 г. 0:40
To: user @spark
Subject: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

This works with Spark 2 with Oracle jar file added to


$SPARK_HOME/conf/ spark-defaults.conf





spark.driver.extraClassPath  /home/hduser/jars/ojdbc6.jar

spark.executor.extraClassPath/home/hduser/jars/ojdbc6.jar



and you get

 cala> val s = HiveContext.read.format("jdbc").options(
 | Map("url" -> _ORACLEserver,
 | "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS 
CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED, 
RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
 | "partitionColumn" -> "ID",
 | "lowerBound" -> "1",
 | "upperBound" -> "1",
 | "numPartitions" -> "10",
 | "user" -> _username,
 | "password" -> _password)).load
s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5 more 
fields]

that works.
However, with CDH 5.5.2 (Spark 1.5) it fails with error


java.sql.SQLException: No suitable driver

  at java.sql.DriverManager.getDriver(DriverManager.java:315)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)

  at scala.Option.getOrElse(Option.scala:121)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)

  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)


Any ideas?

Thanks





Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.