Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-14 Thread Russell Spitzer
Spark 2.0 defaults to Scala 2.11, so if you didn't build it yourself you
need the 2.11 artifact for the Spark Cassandra Connector.

On Wed, Sep 14, 2016 at 7:44 PM Trivedi Amit 
wrote:

> Hi,
>
>
>
> I am testing a pyspark program that will read from a csv file and write
> data into Cassandra table. I am using pyspark with
> spark-cassandra-connector 2.10:2.0.0-M3. I am using Spark v2.0.0.
>
> While executing below command
>
> ```df.write.format("org.apache.spark.sql.cassandra").mode('append').options(
> table="test_table", keyspace="test").save()```
>
> I am getting
> ```
> py4j.protocol.Py4JJavaError: An error occurred while calling o47.save.
> : java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
>
> at
> com.datastax.spark.connector.types.TypeConverter$.(TypeConverter.scala:116)
>
> at
> com.datastax.spark.connector.types.TypeConverter$.(TypeConverter.scala)
>
> at
> com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
>
> at
> com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
>
> at
> com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$3.apply(SqlRowWriter.scala:18)
>
> at
> com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$3.apply(SqlRowWriter.scala:18)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at
> com.datastax.spark.connector.writer.SqlRowWriter.(SqlRowWriter.scala:18)
>
> at
> com.datastax.spark.connector.writer.SqlRowWriter$Factory$.rowWriter(SqlRowWriter.scala:36)
>
> at
> com.datastax.spark.connector.writer.SqlRowWriter$Factory$.rowWriter(SqlRowWriter.scala:34)
>
> at
> com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:271)
>
> at
> com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
>
> at
> org.apache.spark.sql.cassandra.CassandraSourceRelation.insert(CassandraSourceRelation.scala:66)
>
> at
> org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:85)
>
> at
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:429)
>
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> ```
>
> Google search for this error lead to threads where folks were talking
> about using scala version 2.10 instead of 2.11 for this issue. However, I
> am not using Scala and I am assuming that 2.10 in spark-cassandra-connector
> is Scala version.
>
> Don't know how to fix or get around this issue. Appreciate any help.
>
> Thanks
> AT
>
>
>


Re: Write to Cassandra table from pyspark fails with scala reflect error

2016-09-14 Thread Trivedi Amit
Hi,



I am testing a pyspark program that will read from a csv file and write data 
into Cassandra table. I am using pyspark with spark-cassandra-connector 
2.10:2.0.0-M3. I am using Spark v2.0.0.

While executing below command

```df.write.format("org.apache.spark.sql.cassandra").mode('append').options( 
table="test_table", keyspace="test").save()```

I am getting
```
py4j.protocol.Py4JJavaError: An error occurred while calling o47.save.
: java.lang.NoSuchMethodError: 
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
        at 
com.datastax.spark.connector.types.TypeConverter$.(TypeConverter.scala:116)
        at 
com.datastax.spark.connector.types.TypeConverter$.(TypeConverter.scala)
        at 
com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:50)
        at 
com.datastax.spark.connector.types.BigIntType$.converterToCassandra(PrimitiveColumnType.scala:46)
        at 
com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$3.apply(SqlRowWriter.scala:18)
        at 
com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$3.apply(SqlRowWriter.scala:18)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at 
com.datastax.spark.connector.writer.SqlRowWriter.(SqlRowWriter.scala:18)
        at 
com.datastax.spark.connector.writer.SqlRowWriter$Factory$.rowWriter(SqlRowWriter.scala:36)
        at 
com.datastax.spark.connector.writer.SqlRowWriter$Factory$.rowWriter(SqlRowWriter.scala:34)
        at 
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:271)
        at 
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
        at 
org.apache.spark.sql.cassandra.CassandraSourceRelation.insert(CassandraSourceRelation.scala:66)
        at 
org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:85)
        at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:429)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
```

Google search for this error lead to threads where folks were talking about 
using scala version 2.10 instead of 2.11 for this issue. However, I am not 
using Scala and I am assuming that 2.10 in spark-cassandra-connector is Scala 
version. 

Don't know how to fix or get around this issue. Appreciate any help.
ThanksAT


   

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
The performance I mentioned here is all on local(my laptop).
I have tried the same thing on cluster(Elastic MapReduce) and have seen
even worse results.

Is there a way this can be done efficiently?If any of you might have tried
it.


On Wednesday, September 14, 2016, Jörn Franke  wrote:

> It could be that by using the rdd it converts the data from the internal
> format to Java objects (-> much more memory is needed), which may lead to
> spill over to disk. This conversion takes a lot of time. Then, you need to
> transfer these Java objects via network to one single node (repartition
> ...), which takes on a 1 gbit network for 3 gb (since it may transfer Java
> objects this might be even more for 3 gb) under optimal conditions ca 25
> seconds (if no other transfers happening at the same time, jumbo frames
> activated etc). On the destination node we may have again spill over to
> disk. Then you store them to a single disk (potentially multiple if you
> have and use HDFS) which takes also time (assuming that no other process
> uses this disk).
>
> Btw spark-csv can be used with different dataframes.
> As said, other options are compression, avoid repartitioning (to avoid
> network transfer), avoid spilling to disk (provide memory in yarn etc),
> increase network bandwidth ...
>
> On 14 Sep 2016, at 14:22, sanat kumar Patnaik  > wrote:
>
> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
>
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
>
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
>
> That is where I am concerned, the time to write a text file compared to
> json grows exponentially.
>
> On Wednesday, September 14, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com
> > wrote:
>
>> These intermediate file what sort of files are there. Are there csv type
>> files.
>>
>> I agree that DF is more efficient than an RDD as it follows tabular
>> format (I assume that is what you mean by "columnar" format). So if you
>> read these files in a bath process you may not worry too much about
>> execution time?
>>
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I
>> think it is pretty efficient.
>>
>> For myself, I would do something like below
>>
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <
>> patnaik.sa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>>- I am writing a batch application using Spark SQL and Dataframes.
>>>This application has a bunch of file joins and there are intermediate
>>>points where I need to drop a file for downstream applications to 
>>> consume.
>>>- The problem is all these downstream applications are still on
>>>legacy, so they still require us to drop them a text file.As you all must
>>>be knowing Dataframe stores the data in columnar format internally.
>>>
>>> Only way I found out how to do this and which looks awfully slow is this:
>>>
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>
>>> Is there any better way to do this?
>>>
>>> *P.S: *The other workaround would be to use RDDs for all my operations.
>>> But I am wary of using them as the documentation says Dataframes are way
>>> faster because of the Catalyst engine running behind the scene.
>>>
>>> Please suggest if any of you might have tried something similar.
>>>
>>
>>
>
> --
> Regards,
> Sanat Patnaik
> Cell->804-882-6424
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Job Opportunity

2016-09-14 Thread Data Junkie
Looking for seasoned Apache Spark/Scala developers in the US - east coast or
west coast. Please mail to apachesparkrecruitm...@gmail.com if interested
with your resume.

No headhunters/outsourcing.


Job Opportunity

2016-09-14 Thread datajunkie
Looking for seasoned Apache Spark/Scala developers in the US - east coast or
west coast. Please mail to apachesparkrecruitm...@gmail.com if interested
with your resume.

No headhunters/outsourcing. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-Opportunity-tp27724.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Marco Mistroni
many thanks Sean!
kr
 marco

On Wed, Sep 14, 2016 at 10:33 PM, Sean Owen  wrote:

> If it helps, I've already updated that code for the 2nd edition, which
> will be based on ~Spark 2.1:
>
> https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/
> scala/com/cloudera/datascience/rdf/RunRDF.scala#L220
>
> This should be an equivalent working example that deals with
> categoricals via VectorIndexer.
>
> You're right that you must use it because it adds the metadata that
> says it's categorical. I'm not sure of another way to do it?
>
> Sean
>
>
> On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni 
> wrote:
> > hi all
> >  i have been toying around with this well known RandomForestExample code
> >
> > val forest = RandomForest.trainClassifier(
> >   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
> >   "auto", "entropy", 30, 300)
> >
> > This comes from this link
> > (https://www.safaribooksonline.com/library/view/advanced-analytics-with/
> 9781491912751/ch04.html),
> > and also Sean Owen's presentation
> >
> > (https://www.youtube.com/watch?v=ObiCMJ24ezs)
> >
> >
> >
> > and now i want to migrate it to use ML Libraries.
> > The problem i have is that the MLLib  example has categorical features,
> and
> > i cannot find
> > a way to use categorical features with ML
> > Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> > input
> > column for features.
> > I am at the moment using Vectorassembler instead, but i cannot find a
> way to
> > achieve the
> > same
> > I have checed spark samples, but all i can see is RandomForestClassifier
> > using VectorIndexer for 1 feature
> >
> >
> >
> > Could anyone assist?
> > This is my current codewhat do i need to add to take into account
> > categorical features?
> >
> > val labelIndexer = new StringIndexer()
> >   .setInputCol("Col0")
> >   .setOutputCol("indexedLabel")
> >   .fit(data)
> >
> > val features = new VectorAssembler()
> >   .setInputCols(Array(
> > "Col1", "Col2", "Col3", "Col4", "Col5",
> > "Col6", "Col7", "Col8", "Col9", "Col10"))
> >   .setOutputCol("features")
> >
> > val labelConverter = new IndexToString()
> >   .setInputCol("prediction")
> >   .setOutputCol("predictedLabel")
> >   .setLabels(labelIndexer.labels)
> >
> > val rf = new RandomForestClassifier()
> >   .setLabelCol("indexedLabel")
> >   .setFeaturesCol("features")
> >   .setNumTrees(20)
> >   .setMaxDepth(30)
> >   .setMaxBins(300)
> >   .setImpurity("entropy")
> >
> > println("Kicking off pipeline..")
> >
> > val pipeline = new Pipeline()
> >   .setStages(Array(labelIndexer, features, rf, labelConverter))
> >
> > thanks in advance and regards
> >  Marco
> >
>


Re: Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Sean Owen
If it helps, I've already updated that code for the 2nd edition, which
will be based on ~Spark 2.1:

https://github.com/sryza/aas/blob/master/ch04-rdf/src/main/scala/com/cloudera/datascience/rdf/RunRDF.scala#L220

This should be an equivalent working example that deals with
categoricals via VectorIndexer.

You're right that you must use it because it adds the metadata that
says it's categorical. I'm not sure of another way to do it?

Sean


On Wed, Sep 14, 2016 at 10:18 PM, Marco Mistroni  wrote:
> hi all
>  i have been toying around with this well known RandomForestExample code
>
> val forest = RandomForest.trainClassifier(
>   trainData, 7, Map(10 -> 4, 11 -> 40), 20,
>   "auto", "entropy", 30, 300)
>
> This comes from this link
> (https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
> and also Sean Owen's presentation
>
> (https://www.youtube.com/watch?v=ObiCMJ24ezs)
>
>
>
> and now i want to migrate it to use ML Libraries.
> The problem i have is that the MLLib  example has categorical features, and
> i cannot find
> a way to use categorical features with ML
> Apparently i should use VectorIndexer, but VectorIndexer assumes only one
> input
> column for features.
> I am at the moment using Vectorassembler instead, but i cannot find a way to
> achieve the
> same
> I have checed spark samples, but all i can see is RandomForestClassifier
> using VectorIndexer for 1 feature
>
>
>
> Could anyone assist?
> This is my current codewhat do i need to add to take into account
> categorical features?
>
> val labelIndexer = new StringIndexer()
>   .setInputCol("Col0")
>   .setOutputCol("indexedLabel")
>   .fit(data)
>
> val features = new VectorAssembler()
>   .setInputCols(Array(
> "Col1", "Col2", "Col3", "Col4", "Col5",
> "Col6", "Col7", "Col8", "Col9", "Col10"))
>   .setOutputCol("features")
>
> val labelConverter = new IndexToString()
>   .setInputCol("prediction")
>   .setOutputCol("predictedLabel")
>   .setLabels(labelIndexer.labels)
>
> val rf = new RandomForestClassifier()
>   .setLabelCol("indexedLabel")
>   .setFeaturesCol("features")
>   .setNumTrees(20)
>   .setMaxDepth(30)
>   .setMaxBins(300)
>   .setImpurity("entropy")
>
> println("Kicking off pipeline..")
>
> val pipeline = new Pipeline()
>   .setStages(Array(labelIndexer, features, rf, labelConverter))
>
> thanks in advance and regards
>  Marco
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Please assist: migrating RandomForestExample from MLLib to ML

2016-09-14 Thread Marco Mistroni
hi all
 i have been toying around with this well known RandomForestExample code

val forest = RandomForest.trainClassifier(
  trainData, 7, Map(10 -> 4, 11 -> 40), 20,
  "auto", "entropy", 30, 300)

This comes from this link (
https://www.safaribooksonline.com/library/view/advanced-analytics-with/9781491912751/ch04.html),
and also Sean Owen's presentation

(https://www.youtube.com/watch?v=ObiCMJ24ezs)



and now i want to migrate it to use ML Libraries.
The problem i have is that the MLLib  example has categorical features, and
i cannot find
a way to use categorical features with ML
Apparently i should use VectorIndexer, but VectorIndexer assumes only one
input
column for features.
I am at the moment using Vectorassembler instead, but i cannot find a way
to achieve the
same
I have checed spark samples, but all i can see is RandomForestClassifier
using VectorIndexer for 1 feature



Could anyone assist?
This is my current codewhat do i need to add to take into account
categorical features?

val labelIndexer = new StringIndexer()
  .setInputCol("Col0")
  .setOutputCol("indexedLabel")
  .fit(data)

val features = new VectorAssembler()
  .setInputCols(Array(
"Col1", "Col2", "Col3", "Col4", "Col5",
"Col6", "Col7", "Col8", "Col9", "Col10"))
  .setOutputCol("features")

val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("features")
  .setNumTrees(20)
  .setMaxDepth(30)
  .setMaxBins(300)
  .setImpurity("entropy")

println("Kicking off pipeline..")

val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, features, rf, labelConverter))

thanks in advance and regards
 Marco


Re: Streaming - lookup against reference data

2016-09-14 Thread Jörn Franke
Hmm is it just a lookup and the values are small? I do not think that in this 
case redis needs to be installed on each worker node. Redis has a rather 
efficient protocol. Hence one or a few dedicated redis nodes probably fit your 
purpose more then needed. Just try to reuse connections and do not establish it 
for each lookup from the same node.

Additionally Redis has a lot of interesting data structures such as 
hyperloglogs.

Hbase - you can design here where to store which part of the reference data set 
and partition in Spark accordingly. Depends on the data and is tricky.

About the other options I am a bit skeptical - especially since you need to 
include updated data, might have side effects.

Nevertheless, you mention all the options that are possible. I guess for a true 
evaluation you have to check your use case, the envisioned future architecture 
for other use cases, required performance, maintability etc.

> On 14 Sep 2016, at 20:44, Tom Davis  wrote:
> 
> Hi all,
> 
> Interested in patterns people use in the wild for lookup against reference 
> data sets from a Spark streaming job. The reference dataset will be updated 
> during the life of the job (although being 30mins out of date wouldn't be an 
> issue, for example). 
> 
> So far I have come up with a few options, all of which have advantages and 
> disadvantages:
> 
> 1. For small reference datasets, distribute the data as an in memory Map() 
> from the driver, refreshing it inside the foreachRDD() loop. 
> 
> Obviously the limitation here is size. 
> 
> 2. Run a Redis (or similar) cache on each worker node, perform lookups 
> against this. 
> 
> There's some complexity to managing this, probably outside of the Spark job.
> 
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop on 
> the driver. Perform a join of the reference and stream batch RDDs. Perhaps 
> keep the reference RDD in memory. 
> 
> I suspect that this will scale, but I also suspect there's going to be the 
> potential for a lot of data shuffling across the network which will slow 
> things down. 
> 
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data 
> available to other services but is a call out over the network, albeit within 
> the cluster.
> 
> I guess there's no solution that fits all, but interested in other people's 
> experience and whether I've missed anything obvious. 
> 
> Thanks,
> 
> Tom


Re: RMSE in ALS

2016-09-14 Thread Sean Owen
Yes, that's what TF-IDF is, but it's just a statistic and not a
ranking. If you're using that to fill in a user-item matrix then that
is your model; you don't need ALS. Building an ALS model on this is
kind of like building a model on a model. Applying RMSE in this case
is a little funny, given the distribution of TF-IDF values. It's hard
to say what's normal but you're saying the test error is both 2.3 and
32.5. Regardless of which is really the test error it indicates
something is wrong with the modeling process. These ought not be too
different.

On Wed, Sep 14, 2016 at 9:22 PM, Pasquinell Urbani
 wrote:
> The implicit rankings are the output of Tf-idf. I.e.:
> Each_ranking= frecuency of an ítem * log(amount of total customers/amount of
> customers buying the ítem)
>
>
> El 14 sept. 2016 17:14, "Sean Owen"  escribió:
>>
>> What are implicit rankings here?
>> RMSE would not be an appropriate measure for comparing rankings. There are
>> ranking metrics like mean average precision that would be appropriate
>> instead.
>>
>> On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani
>>  wrote:
>>>
>>> It was a typo mistake, both are rmse.
>>>
>>> The frecency distribution of rankings is the following
>>>
>>>
>>>
>>> As you can see, I have heavy tail, but the majority of the observations
>>> rely near ranking  5.
>>>
>>> I'm working with implicit rankings (generated by TF-IDF), can this affect
>>> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>>>
>>> Thank you.
>>>
>>>
>>>
>>> 2016-09-14 16:49 GMT-03:00 Sean Owen :

 There is no way to answer this without knowing what your inputs are
 like. If they're on the scale of thousands, that's small (good). If
 they're on the scale of 1-5, that's extremely poor.

 What's RMS vs RMSE?

 On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
  wrote:
 > Hi Community
 >
 > I'm performing an ALS for retail product recommendation. Right now I'm
 > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
 > experience? Does the transformation of the ranking values important
 > for
 > having good errors?
 >
 > Thank you all.
 >
 > Pasquinell Urbani
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
The implicit rankings are the output of Tf-idf. I.e.:
Each_ranking= frecuency of an ítem * log(amount of total customers/amount
of customers buying the ítem)

El 14 sept. 2016 17:14, "Sean Owen"  escribió:

> What are implicit rankings here?
> RMSE would not be an appropriate measure for comparing rankings. There are
> ranking metrics like mean average precision that would be appropriate
> instead.
>
> On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani <
> pasquinell.urb...@exalitica.com> wrote:
>
>> It was a typo mistake, both are rmse.
>>
>> The frecency distribution of rankings is the following
>>
>> [image: Imágenes integradas 2]
>>
>> As you can see, I have heavy tail, but the majority of the observations
>> rely near ranking  5.
>>
>> I'm working with implicit rankings (generated by TF-IDF), can this affect
>> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>>
>> Thank you.
>>
>>
>>
>> 2016-09-14 16:49 GMT-03:00 Sean Owen :
>>
>>> There is no way to answer this without knowing what your inputs are
>>> like. If they're on the scale of thousands, that's small (good). If
>>> they're on the scale of 1-5, that's extremely poor.
>>>
>>> What's RMS vs RMSE?
>>>
>>> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>>>  wrote:
>>> > Hi Community
>>> >
>>> > I'm performing an ALS for retail product recommendation. Right now I'm
>>> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
>>> > experience? Does the transformation of the ranking values important for
>>> > having good errors?
>>> >
>>> > Thank you all.
>>> >
>>> > Pasquinell Urbani
>>>
>>
>>
>


Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Sure the partitions exist, but is there data in all partitions?   Try the
kafka offset checker:

kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper
localhost:2181 -group  -topic 

On Wed, Sep 14, 2016 at 1:00 PM, 
wrote:

> Sure thanks I have removed dev.   I do see all the partitions created
> correctly at Kafka side.
>
>
>
> Topic:CEQReceiver   PartitionCount:5ReplicationFactor:1
> Configs:
>
> Topic: CEQReceiver  Partition: 0Leader: 90  Replicas:
> 90Isr: 90
>
> Topic: CEQReceiver  Partition: 1Leader: 86  Replicas:
> 86Isr: 86
>
> Topic: CEQReceiver  Partition: 2Leader: 87  Replicas:
> 87Isr: 87
>
> Topic: CEQReceiver  Partition: 3Leader: 88  Replicas:
> 88Isr: 88
>
> Topic: CEQReceiver  Partition: 4Leader: 89  Replicas:
> 89Isr: 89
>
>
>
> *From:* Jeff Nadler [mailto:jnad...@srcginc.com]
> *Sent:* Wednesday, September 14, 2016 12:46 PM
> *To:* Rachana Srivastava
> *Cc:* user@spark.apache.org; d...@spark.apache.org
> *Subject:* Re: Not all KafkaReceivers processing the data Why?
>
>
>
> Have you checked your Kafka brokers to be certain that data is going to
> all 5 partitions?We use something very similar (but in Scala) and have
> no problems.
>
>
>
> Also you might not get the best response blasting both user+dev lists like
> this.   Normally you'd want to use 'user' only.
>
>
>
> -Jeff
>
>
>
>
>
> On Wed, Sep 14, 2016 at 12:33 PM, Rachana Srivastava  markmonitor.com> wrote:
>
> Hello all,
>
>
>
> I have created a Kafka topic with 5 partitions.  And I am using
> createStream receiver API like following.   But somehow only one receiver
> is getting the input data. Rest of receivers are not processign anything.
> Can you please help?
>
>
>
> JavaPairDStream messages = null;
>
>
>
> if(sparkStreamCount > 0){
>
> // We create an input DStream for each partition of the
> topic, unify those streams, and then repartition the unified stream.
>
> List> kafkaStreams = new
> ArrayList>(sparkStreamCount);
>
> for (int i = 0; i < sparkStreamCount; i++) {
>
> kafkaStreams.add(
> KafkaUtils.createStream(jssc, contextVal.getString(KAFKA_ZOOKEEPER),
> contextVal.getString(KAFKA_GROUP_ID), kafkaTopicMap));
>
> }
>
> messages = jssc.union(kafkaStreams.get(0),
> kafkaStreams.subList(1, kafkaStreams.size()));
>
> }
>
> else{
>
> messages =  KafkaUtils.createStream(jssc,
> contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID),
> kafkaTopicMap);
>
> }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: RMSE in ALS

2016-09-14 Thread Sean Owen
What are implicit rankings here?
RMSE would not be an appropriate measure for comparing rankings. There are
ranking metrics like mean average precision that would be appropriate
instead.

On Wed, Sep 14, 2016 at 9:11 PM, Pasquinell Urbani <
pasquinell.urb...@exalitica.com> wrote:

> It was a typo mistake, both are rmse.
>
> The frecency distribution of rankings is the following
>
> [image: Imágenes integradas 2]
>
> As you can see, I have heavy tail, but the majority of the observations
> rely near ranking  5.
>
> I'm working with implicit rankings (generated by TF-IDF), can this affect
> the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)
>
> Thank you.
>
>
>
> 2016-09-14 16:49 GMT-03:00 Sean Owen :
>
>> There is no way to answer this without knowing what your inputs are
>> like. If they're on the scale of thousands, that's small (good). If
>> they're on the scale of 1-5, that's extremely poor.
>>
>> What's RMS vs RMSE?
>>
>> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>>  wrote:
>> > Hi Community
>> >
>> > I'm performing an ALS for retail product recommendation. Right now I'm
>> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
>> > experience? Does the transformation of the ranking values important for
>> > having good errors?
>> >
>> > Thank you all.
>> >
>> > Pasquinell Urbani
>>
>
>


Re: RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
It was a typo mistake, both are rmse.

The frecency distribution of rankings is the following

[image: Imágenes integradas 2]

As you can see, I have heavy tail, but the majority of the observations
rely near ranking  5.

I'm working with implicit rankings (generated by TF-IDF), can this affect
the error? (I'm currently using trainImplicit in ALS, spark 1.6.2)

Thank you.



2016-09-14 16:49 GMT-03:00 Sean Owen :

> There is no way to answer this without knowing what your inputs are
> like. If they're on the scale of thousands, that's small (good). If
> they're on the scale of 1-5, that's extremely poor.
>
> What's RMS vs RMSE?
>
> On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
>  wrote:
> > Hi Community
> >
> > I'm performing an ALS for retail product recommendation. Right now I'm
> > reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
> > experience? Does the transformation of the ranking values important for
> > having good errors?
> >
> > Thank you all.
> >
> > Pasquinell Urbani
>


Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeremy Smith
Take a look at how the messages are actually distributed across the
partitions. If the message keys have a low cardinality, you might get poor
distribution (i.e. all the messages are actually only in two of the five
partitions, leading to what you see in Spark).

If you take a look at the Kafka data directories, you can probably get an
idea of the distribution by just examining the sizes of each partition.

Jeremy

On Wed, Sep 14, 2016 at 12:33 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> Hello all,
>
>
>
> I have created a Kafka topic with 5 partitions.  And I am using
> createStream receiver API like following.   But somehow only one receiver
> is getting the input data. Rest of receivers are not processign anything.
> Can you please help?
>
>
>
> JavaPairDStream messages = null;
>
>
>
> if(sparkStreamCount > 0){
>
> // We create an input DStream for each partition of the
> topic, unify those streams, and then repartition the unified stream.
>
> List> kafkaStreams = new
> ArrayList>(sparkStreamCount);
>
> for (int i = 0; i < sparkStreamCount; i++) {
>
> kafkaStreams.add(
> KafkaUtils.createStream(jssc, contextVal.getString(KAFKA_ZOOKEEPER),
> contextVal.getString(KAFKA_GROUP_ID), kafkaTopicMap));
>
> }
>
> messages = jssc.union(kafkaStreams.get(0),
> kafkaStreams.subList(1, kafkaStreams.size()));
>
> }
>
> else{
>
> messages =  KafkaUtils.createStream(jssc,
> contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID),
> kafkaTopicMap);
>
> }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: RMSE in ALS

2016-09-14 Thread Sean Owen
There is no way to answer this without knowing what your inputs are
like. If they're on the scale of thousands, that's small (good). If
they're on the scale of 1-5, that's extremely poor.

What's RMS vs RMSE?

On Wed, Sep 14, 2016 at 8:33 PM, Pasquinell Urbani
 wrote:
> Hi Community
>
> I'm performing an ALS for retail product recommendation. Right now I'm
> reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
> experience? Does the transformation of the ranking values important for
> having good errors?
>
> Thank you all.
>
> Pasquinell Urbani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Jeff Nadler
Have you checked your Kafka brokers to be certain that data is going to all
5 partitions?We use something very similar (but in Scala) and have no
problems.

Also you might not get the best response blasting both user+dev lists like
this.   Normally you'd want to use 'user' only.

-Jeff


On Wed, Sep 14, 2016 at 12:33 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> Hello all,
>
>
>
> I have created a Kafka topic with 5 partitions.  And I am using
> createStream receiver API like following.   But somehow only one receiver
> is getting the input data. Rest of receivers are not processign anything.
> Can you please help?
>
>
>
> JavaPairDStream messages = null;
>
>
>
> if(sparkStreamCount > 0){
>
> // We create an input DStream for each partition of the
> topic, unify those streams, and then repartition the unified stream.
>
> List> kafkaStreams = new
> ArrayList>(sparkStreamCount);
>
> for (int i = 0; i < sparkStreamCount; i++) {
>
> kafkaStreams.add(
> KafkaUtils.createStream(jssc, contextVal.getString(KAFKA_ZOOKEEPER),
> contextVal.getString(KAFKA_GROUP_ID), kafkaTopicMap));
>
> }
>
> messages = jssc.union(kafkaStreams.get(0),
> kafkaStreams.subList(1, kafkaStreams.size()));
>
> }
>
> else{
>
> messages =  KafkaUtils.createStream(jssc,
> contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID),
> kafkaTopicMap);
>
> }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Not all KafkaReceivers processing the data Why?

2016-09-14 Thread Rachana Srivastava
Hello all,

I have created a Kafka topic with 5 partitions.  And I am using createStream 
receiver API like following.   But somehow only one receiver is getting the 
input data. Rest of receivers are not processign anything.  Can you please help?

JavaPairDStream messages = null;

if(sparkStreamCount > 0){
// We create an input DStream for each partition of the topic, 
unify those streams, and then repartition the unified stream.
List> kafkaStreams = new 
ArrayList>(sparkStreamCount);
for (int i = 0; i < sparkStreamCount; i++) {
kafkaStreams.add( KafkaUtils.createStream(jssc, 
contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID), 
kafkaTopicMap));
}
messages = jssc.union(kafkaStreams.get(0), 
kafkaStreams.subList(1, kafkaStreams.size()));
}
else{
messages =  KafkaUtils.createStream(jssc, 
contextVal.getString(KAFKA_ZOOKEEPER), contextVal.getString(KAFKA_GROUP_ID), 
kafkaTopicMap);
}



[cid:image001.png@01D20E84.3558F520]






RMSE in ALS

2016-09-14 Thread Pasquinell Urbani
Hi Community

I'm performing an ALS for retail product recommendation. Right now I'm
reaching rms_test = 2.3 and rmse_test = 32.5. Is this too much in your
experience? Does the transformation of the ranking values important for
having good errors?

Thank you all.

Pasquinell Urbani


LIVY VS Spark Job Server

2016-09-14 Thread SamyaMaiti
Hi Team,

I am evaluating different ways to submit & monitor spark Jobs using REST
Interfaces. 

When to use Livy vs Spark Job Server?

Regards,
Sam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LIVY-VS-Spark-Job-Server-tp27722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



CPU Consumption of spark process

2016-09-14 Thread شجاع الرحمن بیگ
Hi All,

I have a system with 6 physical cores and each core has 8 hardware threads
resulting in 48 virtual cores. Following are the setting in configuration
files.

*spark-env.sh*
export SPARK_WORKER_CORES=1

*spark-defaults.conf*
spark.driver.cores 1
spark.executor.cores 1
spark.cores.max 1

So it means it should only use 1 virtual core but if we see the output from
the TOP command, some time, it has very huge spikes e.g the CPU consumption
is above 4000 e.g.

 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 22581 sbaig 20   0  0.278t 0.064t  37312 S  4728  6.4   7:11.30 java

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 22581 sbaig 20   0  0.278t 0.065t  37312 S  1502  6.5   8:22.75 java
...
   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 22581 sbaig 20   0  0.278t 0.065t  37312 S  4035  6.6   9:51.64 java
...
  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 22581 sbaig 20   0  0.278t 0.080t  37312 S  3445  8.1  15:06.26 java
...
  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 22581 sbaig 20   0  0.278t 0.082t  37312 S  4178  8.2  17:37.59 java
...

It means, instead of using 1 virtual core, spark is using all available
cores in the system so my question is why it is behaving like this? why it
is not using only 1 core during execution of job which we set in
SPARK_WORKER_CORES property.

I am using spark 1.6.1 with standalone mode.

Any help will be highly appreciated.
Thanks
Shuja

-- 
Regards
Shuja-ur-Rehman Baig



Streaming - lookup against reference data

2016-09-14 Thread Tom Davis
Hi all,

Interested in patterns people use in the wild for lookup against reference
data sets from a Spark streaming job. The reference dataset will be updated
during the life of the job (although being 30mins out of date wouldn't be
an issue, for example).

So far I have come up with a few options, all of which have advantages and
disadvantages:

1. For small reference datasets, distribute the data as an in memory Map()
from the driver, refreshing it inside the foreachRDD() loop.

Obviously the limitation here is size.

2. Run a Redis (or similar) cache on each worker node, perform lookups
against this.

There's some complexity to managing this, probably outside of the Spark job.

3. Load the reference data into an RDD, again inside the foreachRDD() loop
on the driver. Perform a join of the reference and stream batch RDDs.
Perhaps keep the reference RDD in memory.

I suspect that this will scale, but I also suspect there's going to be the
potential for a lot of data shuffling across the network which will slow
things down.

4. Similar to the Redis option, but use Hbase. Scales well and makes data
available to other services but is a call out over the network, albeit
within the cluster.

I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.

Thanks,

Tom


Re: Spark Interview questions

2016-09-14 Thread Jacek Laskowski
Hi,

Doh, Mich, it's way too much to ask for "typical Spark interview
questions for Spark/Scala junior roles". There are plenty of such
questions and I don't think there's a way to have them all noted down.

Spark supports 5 languages, offers 4 modules + Core, and presents
itself differently to developers, admins and performance g33ks. With 3
supported cluster managers in and you see I'm staying far from such
questions. Too much to handle.

Pass.

p.s. The more I'm with Spark the more I'm overwhelmed how complex it
is. So many sections with FIXMEs/TODOs in my Spark notes...

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Sep 14, 2016 at 4:09 PM, Mich Talebzadeh
 wrote:
> Hi Ashok,
>
> I am sure we all have some war stories some of which I recall:
>
> What is meant by RDD, DataFrame and Dataset
> What is the meant by "All transformations in Spark are lazy"?
> What are the two types of operations supported by RDD?
> What is meant by Spark running under a certain mode?
> Explain the difference between Spark Running in a Standalone mode and Yarn
> cluster mode
> What is the difference between Spark running in Yarn client mode and Yarn
> cluster mode.
> What is the difference between persist and cache
> If you cache a DataFrame what does it do and where is the memory consumed
> come from. Can you give a place where you can see its measurements
> What is meant by DAG? A broad outline
> What is shuffling in Spark. How can you minimise its impact
> How would you specify your spark hardware in a medium size set-up say 8 node
> cluster.
> How could one minimise the network latency within Spark and the underlying
> storage (assuming HDFS here)
> How can you parallelize your JDBC connection to a database say any RDBMS?
> How does it work
> What is the use case for Spark Thrift Server.
> How would you typically read and process a tab separated file into Spark
> If you have an OOM message in Spark how would you go about diagnosing the
> problem
> What is meant by spark-submit. How would you use it
> What is a Spark driver? If you run Spark in Local mode how many executors
> can you start
> What is meant by Spark Streaming. What is a use case example
> In Spark Streaming what parameters are important
> What are the typical analytic functions in Spark SQL
> What is the difference between RANK and DENSE_RANK
>
>
> I am sure there are many other questions that one think of. For example,
> someone like Jacek Laskowski can provide more programming questions as he is
> a professional Spark trainer :)
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 14 September 2016 at 12:35, Ashok Kumar 
> wrote:
>>
>> Hi,
>>
>> As a learner I appreciate if you have typical Spark interview questions
>> for Spark/Scala junior roles that you can please forward to me.
>>
>> I will be very obliged
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Best Practices for Spark-Python App Deployment

2016-09-14 Thread RK Aduri
Dear All:

We are trying to deploy ( using Jenkins ) a spark-python app on an edge 
node, however the dilemma is whether to clone the git repo to all the nodes in 
the cluster. The reason is, if we choose to use the deployment mode as cluster 
and master as yarn, then driver expects the current file structure where ever 
it runs in the cluster. Therefore, I am just wondering if there are any best 
practices to do it or just plain way of cloning the git repos all over the 
nodes in cluster. Any suggestion is welcome.

Thanks,
RK
-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark kafka integration issues

2016-09-14 Thread Cody Koeninger
Yeah, an updated version of that blog post is available at

https://github.com/koeninger/kafka-exactly-once

On Wed, Sep 14, 2016 at 11:35 AM, Mukesh Jha  wrote:
> Thanks for the reply Cody.
>
> I found the below article on the same, very helpful. Thanks for the details,
> much appreciated.
>
> http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
>
> On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger  wrote:
>>
>> 1.  see
>> http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
>>  look for HasOffsetRange.  If you really want the info per-message
>> rather than per-partition, createRDD has an overload that takes a
>> messageHandler from MessageAndMetadata to whatever you need
>>
>> 2. createRDD takes type parameters for the key and value decoder, so
>> specify them there
>>
>> 3. you can use spark-streaming-kafka-0-8 against 0.9 or 0.10 brokers.
>> There is a spark-streaming-kafka-0-10 package with additional features
>> that only works on brokers 0.10 or higher.  A pull request for
>> documenting it has been merged, but not deployed.
>>
>> On Tue, Sep 13, 2016 at 6:46 PM, Mukesh Jha 
>> wrote:
>> > Hello fellow sparkers,
>> >
>> > I'm using spark to consume messages from kafka in a non streaming
>> > fashion.
>> > I'm suing the using spark-streaming-kafka-0-8_2.10 & sparkv2.0to do the
>> > same.
>> >
>> > I have a few queries for the same, please get back if you guys have
>> > clues on
>> > the same.
>> >
>> > 1) Is there anyway to get the have the topic and partition & offset
>> > information for each item from the KafkaRDD. I'm using the
>> > KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder] to
>> > create
>> > my kafka RDD.
>> > 2) How to pass my custom Decoder instead of using the String or Byte
>> > decoder
>> > are there any examples for the same?
>> > 3) is there a newer version to consumer from kafka-0.10 & kafka-0.9
>> > clusters
>> >
>> > --
>> > Thanks & Regards,
>> >
>> > Mukesh Jha
>
>
>
>
> --
>
>
> Thanks & Regards,
>
> Mukesh Jha

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-14 Thread Fred Reiss
+1 to this request. I talked last week with a product group within IBM that
is struggling with the same issue. It's pretty common in data cleaning
applications for data in the early stages to have nested lists or sets
inconsistent or incomplete schema information.

Fred

On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> I'm currently trying to create a generic transformation mecanism on a
> Dataframe to modify an arbitrary column regardless of the underlying the
> schema.
>
> It's "relatively" straightforward for complex types like struct>
> to apply an arbitrary UDF on the column and replace the data "inside" the
> struct, however I'm struggling to make it work for complex types containing
> arrays along the way like struct>>.
>
> Michael Armbrust seemed to allude on the mailing list/forum to a way of
> using Encoders to do that, I'd be interested in any pointers, especially
> considering that it's not possible to output any Row or
> GenericRowWithSchema from a UDF (thanks to https://github.com/apache/
> spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/
> apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems).
>
> To sum up, I'd like to find a way to apply a transformation on complex
> nested datatypes (arrays and struct) on a Dataframe updating the value
> itself.
>
> Regards,
>
> *Olivier Girardot*
>


The coming data on Spark Streaming

2016-09-14 Thread pcandido
Hi everyone,

I'm starting in Spark Streaming and would like to know somethings about data
arriving.

I know that SS uses micro-batches and they are received by workers and sent
to RDD. The master, on defined intervals, receives a poiter to micro-batch
in RDD and can use it to process data using mappers and reducers.

1. But, before the master be called, can I work on data? can I do something
for each object that arrives on workers when it arrives?

2. The data stream normally is denoted by an ordered sequence of data. But
when it arrives in micro-baches, I receive a lot of objects at the same
time. How can I determine which order of objects inside batch? Can I extract
the timestamp or ordered ID of the arrive for each object?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-coming-data-on-Spark-Streaming-tp27720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi,

An alternative to Spark could be flume to store data from Kafka to HDFS. It 
provides also some reliability mechanisms and has been explicitly designed for 
import/export and is tested. Not sure if i would go for spark streaming if the 
use case is only storing, but I do not have the full picture of your use case.

Anyway, what you could do is create a directory / hour/ day etc (whatever you 
need) and put the corresponding files there. If there are a lot of small files 
you can put them into a Hadoop Archive (HAR) to reduce load on the namenode.

Best  regards

> On 14 Sep 2016, at 17:28, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> I have a Spark streaming that reads messages/prices from Kafka and writes it 
> as text file to HDFS.
> 
> This is pretty efficient. Its only function is to persist the incoming 
> messages to HDFS.
> 
> This is what it does
>  dstream.foreachRDD { pricesRDD =>
>val x= pricesRDD.count
>// Check if any messages in
>if (x > 0)
>{
>// Combine each partition's results into a single RDD
>  val cachedRDD = pricesRDD.repartition(1).cache
>  cachedRDD.saveAsTextFile("/data/prices/prices_" + 
> System.currentTimeMillis.toString)
> 
> 
> So these are the files on HDFS directory
> 
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862284010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862288010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862290010
> drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11 
> /data/prices/prices_1473862294010
> 
> Now I present these prices to Zeppelin. These files are produced every 2 
> seconds. However, when I get to plot them, I am only interesting in one hours 
> data say.
> I cater for this by using filter on prices (each has a TIMECREATED).
> 
> I don't think this is efficient as I don't want to load all these files. I 
> just want to  to read the prices created in past hour or something.
> 
> One thing I considered was to load all prices by converting 
> System.currentTimeMillis into today's date and fetch the most recent ones. 
> However, this is looking cumbersome. I can create these files with any 
> timestamp extension when persisting but System.currentTimeMillis seems to be 
> most efficient.
> 
> Any alternatives you can think of?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  


Re: ACID transactions on data added from Spark not working

2016-09-14 Thread Mich Talebzadeh
Hi,

I believe this is an issue with Spark handing transactional tables in Hive.

When you add rows from Spark to ORC transactional table, Hive metadata
tables HIVE_LOCKS and TXNS tables are not updated. This does not happen
with Hive itself. As a result these new rows are left in an inconsistent
state.

Also if you delete rows from the said tables using Hive, many delta files
will be produced. In that case Hive will compact them eventually (roll them
into the main file itself), and can read data. However, Spark will not be
able to read the delta files until all is compacted.

Anyway this is my experience.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 17:47, Jack Wenger  wrote:

> Hi there,
>
> I'm trying to use ACID transactions in Hive but I have a problem when the
> data are added with Spark.
>
> First, I created a table with the following statement :
> 
> __
> CREATE TABLE testdb.test(id string, col1 string)
> CLUSTERED BY (id) INTO 4 BUCKETS
> STORED AS ORC TBLPROPERTIES('transactional'='true');
> 
> __
>
> Then I added data with those queries :
> 
> __
> INSERT INTO testdb.test VALUES("1", "A");
> INSERT INTO testdb.test VALUES("2", "B");
> INSERT INTO testdb.test VALUES("3", "C");
> 
> __
>
> And I've been able to delete rows with this query :
> 
> __
> DELETE FROM testdb.test WHERE id="1";
> 
> __
>
> All that worked perfectly, but a problem occurs when I try to delete rows
> that were added with Spark.
>
> What I do in Spark (iPython) :
> 
> __
> hc = HiveContext(sc)
> data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
> data_df = hc.createDataFrame(data)
> data_df.registerTempTable(data_df)
> hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");
> 
> __
>
> Then, when I come back to Hive, I'm able to run a SELECT query on this the
> "test" table.
> However, when I try to run the exact same DELETE query as before, I have
> the following error (it happens after the reduce phase) :
>
> 
> __
>
> Error: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException:
> Hive Runtime Error while processing row (tag=0)
> {"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"r
> owid":0}},"value":null}
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:265)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
> upInformation.java:1671)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
> Error while processing row (tag=0) {"key":{"reducesinkkey0":{"tra
> nsactionid":0,"bucketid":-1,"rowid":0}},"value":null}
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:253)
> ... 7 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
> at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(Fi
> leSinkOperator.java:723)
> at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
> at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(Sele
> ctOperator.java:84)
> at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
> ucer.java:244)
> ... 7 more
>
> 
> __
>
> I have no idea where this is coming from, that is why I'm looking for
> advices on this mailing list.
>
> I'm 

ACID transactions on data added from Spark not working

2016-09-14 Thread Jack Wenger
Hi there,

I'm trying to use ACID transactions in Hive but I have a problem when the
data are added with Spark.

First, I created a table with the following statement :

__
CREATE TABLE testdb.test(id string, col1 string)
CLUSTERED BY (id) INTO 4 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');

__

Then I added data with those queries :

__
INSERT INTO testdb.test VALUES("1", "A");
INSERT INTO testdb.test VALUES("2", "B");
INSERT INTO testdb.test VALUES("3", "C");

__

And I've been able to delete rows with this query :

__
DELETE FROM testdb.test WHERE id="1";

__

All that worked perfectly, but a problem occurs when I try to delete rows
that were added with Spark.

What I do in Spark (iPython) :

__
hc = HiveContext(sc)
data = sc.parallelize([["1", "A"], ["2", "B"], ["3", "C"]])
data_df = hc.createDataFrame(data)
data_df.registerTempTable(data_df)
hc.sql("INSERT INTO testdb.test SELECT * FROM data_df");

__

Then, when I come back to Hive, I'm able to run a SELECT query on this the
"test" table.
However, when I try to run the exact same DELETE query as before, I have
the following error (it happens after the reduce phase) :


__

Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
Hive Runtime Error while processing row (tag=0)
{"key":{"reducesinkkey0":{"transactionid":0,"bucketid":-1,"
rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:265)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
upInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
Error while processing row (tag=0) {"key":{"reducesinkkey0":{"tra
nsactionid":0,"bucketid":-1,"rowid":0}},"value":null}
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:253)
... 7 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(
FileSinkOperator.java:723)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(Sele
ctOperator.java:84)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecRed
ucer.java:244)
... 7 more


__

I have no idea where this is coming from, that is why I'm looking for
advices on this mailing list.

I'm using the Cloudera Quickstart VM (5.4.2).
Hive version : 1.1.0
Spark Version : 1.3.0

And here is the complete output of the Hive DELETE command :

__

hive> delete from testdb.test where id="1";

Query ID = cloudera_20160914090303_795e40b7-ab6a-45b0-8391-6d41d1cfe7bd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 4
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1473858545651_0036, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1473858545651_0036/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1473858545651_0036
Hadoop job information for Stage-1: number of mappers: 2; number of
reducers: 4
2016-09-14 09:03:55,571 Stage-1 map = 0%,  reduce = 0%
2016-09-14 09:04:14,898 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
1.66 sec
2016-09-14 09:04:15,944 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
3.33 sec
2016-09-14 09:04:44,101 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU
4.21 sec
2016-09-14 09:04:46,523 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU
4.79 sec
2016-09-14 09:04:47,673 Stage-1 map = 100%,  reduce = 42%, Cumulative 

Re: Spark kafka integration issues

2016-09-14 Thread Mukesh Jha
Thanks for the reply Cody.

I found the below article on the same, very helpful. Thanks for the
details, much appreciated.

http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/

On Tue, Sep 13, 2016 at 8:14 PM, Cody Koeninger  wrote:

> 1.  see http://spark.apache.org/docs/latest/streaming-kafka-
> integration.html#approach-2-direct-approach-no-receivers
>  look for HasOffsetRange.  If you really want the info per-message
> rather than per-partition, createRDD has an overload that takes a
> messageHandler from MessageAndMetadata to whatever you need
>
> 2. createRDD takes type parameters for the key and value decoder, so
> specify them there
>
> 3. you can use spark-streaming-kafka-0-8 against 0.9 or 0.10 brokers.
> There is a spark-streaming-kafka-0-10 package with additional features
> that only works on brokers 0.10 or higher.  A pull request for
> documenting it has been merged, but not deployed.
>
> On Tue, Sep 13, 2016 at 6:46 PM, Mukesh Jha 
> wrote:
> > Hello fellow sparkers,
> >
> > I'm using spark to consume messages from kafka in a non streaming
> fashion.
> > I'm suing the using spark-streaming-kafka-0-8_2.10 & sparkv2.0to do the
> > same.
> >
> > I have a few queries for the same, please get back if you guys have
> clues on
> > the same.
> >
> > 1) Is there anyway to get the have the topic and partition & offset
> > information for each item from the KafkaRDD. I'm using the
> > KafkaUtils.createRDD[String, String, StringDecoder, StringDecoder] to
> create
> > my kafka RDD.
> > 2) How to pass my custom Decoder instead of using the String or Byte
> decoder
> > are there any examples for the same?
> > 3) is there a newer version to consumer from kafka-0.10 & kafka-0.9
> clusters
> >
> > --
> > Thanks & Regards,
> >
> > Mukesh Jha
>



-- 


Thanks & Regards,

*Mukesh Jha *


Re: Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Marcelo Vanzin
Use:

spark-submit --jars /path/sqldriver.jar --conf
spark.driver.extraClassPath=sqldriver.jar --conf
spark.executor.extraClassPath=sqldriver.jar

In client mode the driver's classpath needs to point to the full path,
not just the name.


On Wed, Sep 14, 2016 at 5:42 AM, Kevin Tran  wrote:
> Hi Everyone,
>
> I tried in cluster mode on YARN
>  * spark-submit  --jars /path/sqldriver.jar
>  * --driver-class-path
>  * spark-env.sh
> SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/*"
>  * spark-defaults.conf
> spark.driver.extraClassPath
> spark.executor.extraClassPath
>
> None of them works for me !
>
> Does anyone have Spark app work with driver jar on executors before please
> give me your ideas. Thank you.
>
> Cheers,
> Kevin.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
Hi,

I have a Spark streaming that reads messages/prices from Kafka and writes
it as text file to HDFS.

This is pretty efficient. Its only function is to persist the incoming
messages to HDFS.

This is what it does
 dstream.foreachRDD { pricesRDD =>
   val x= pricesRDD.count
   // Check if any messages in
   if (x > 0)
   {
   // Combine each partition's results into a single RDD
 val cachedRDD = pricesRDD.repartition(1).cache
 cachedRDD.saveAsTextFile("/data/prices/prices_" +
System.currentTimeMillis.toString)


So these are the files on HDFS directory

drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
/data/prices/prices_1473862284010
drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
/data/prices/prices_1473862288010
drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
/data/prices/prices_1473862290010
drwxr-xr-x   - hduser supergroup  0 2016-09-14 15:11
/data/prices/prices_1473862294010

Now I present these prices to Zeppelin. These files are produced every 2
seconds. However, when I get to plot them, I am only interesting in one
hours data say.

I cater for this by using filter on prices (each has a TIMECREATED).


I don't think this is efficient as I don't want to load all these files. I
just want to  to read the prices created in past hour or something.


One thing I considered was to load all prices by converting
System.currentTimeMillis into today's date and fetch the most recent ones.
However, this is looking cumbersome. I can create these files with any
timestamp extension when persisting but System.currentTimeMillis seems to
be most efficient.


Any alternatives you can think of?


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong
Dear Dirceu,


thanks you again.


Actually,I never saw it stopped at the breakpoints no matter how long I wait.  
It just skipped the whole anonymous function to direactly reach the first 
breakpoint immediately after the anonymous function body. Is that normal? I 
suspect sth wrong in my debugging operations or settings. I am very new to 
spark and  scala.


Additionally, please give me some detailed instructions about  "Some ides 
provide you a place where you can execute the code to see it's results". 
where is the PLACE


your help badly needed!



发件人: Dirceu Semighini Filho 
发送时间: 2016年9月14日 23:07
收件人: chen yong
主题: Re: 答复: t it does not stop at breakpoints which is in an anonymous function

You can call a count in the ide just to debug, or you can wait until it reaches 
the code, so you can debug.
Some ides provide you a place where you can execute the code to see it's 
results.
Be aware of not adding this operations in your production code, because they 
can slow down the execution of your code.



2016-09-14 11:43 GMT-03:00 chen yong 
>:


Thanks for your reply.

you mean i have to insert some codes, such as x.count or x.collect, 
between the original spark code lines to invoke some operations, right?
but, where is the right places to put my code lines?

Felix


发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月14日 22:33
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: t it does not stop at breakpoints which is in an anonymous function

Hello Felix,
Spark functions run lazy, and that's why it doesn't stop in those breakpoints.
They will be executed only when you call some methods of your dataframe/rdd, 
like the count, collect, ...

Regards,
Dirceu

2016-09-14 11:26 GMT-03:00 chen yong 
>:

Hi all,



I am newbie to spark. I am learning spark by debugging the spark code. It is 
strange to me that it does not stop at breakpoints  which is in  an anonymous 
function, it is normal in ordianry function, though. It that normal. How to 
obverse variables in  an anonymous function.


Please help me. Thanks in advance!


Felix




RE: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread 박경희
Hi Divya
I think, you did try to run spark jobs on yarn. And also I think, you would like to submit jobs to different queues on yarn for each. And maybe you need to prepare queues on yarn to run spark jobs by configuring scheduler.
Best regards,KyeongHee

- Original Message -
Sender : Divya Gehlot 
Date : 2016-09-14 15:08 (GMT+9)
Title : how to specify cores and executor to run spark jobs simultaneously

Hi,
I am on EMR cluster and My cluster configuration is as below:Number of nodes including master node - 3Memory:22.50 GBVCores Total : 16
Active Nodes : 2Spark version- 1.6.1
Parameter set in spark-default.conf








spark.executor.instances         2
spark.executor.cores             8
spark.driver.memory              10473M
spark.executor.memory            9658M
spark.default.parallelism        32




Would let me know if need any other info regarding the cluster .
The current configuration for spark-submit is --driver-memory 5G \--executor-memory 2G \--executor-cores 5 \--num-executors 10 \

Currently  with the above job configuration if I try to run another spark job it will be in accepted state till the first one finishes .How do I optimize or update the above spark-submit configurations to run some more spark jobs simultaneously 
Would really appreciate the help.
Thanks,Divya 

 


Re: Streaming Backpressure with Multiple Streams

2016-09-14 Thread Jeff Nadler
So as you were maybe thinking, it only happens with the combination:

Direct Stream only + backpressure = works as expected

4x Receiver on Topic A + Direct Stream on Topic B + backpressure = the
direct stream is throttled even in the absence of scheduling delay

This is using Spark 1.5.0 on CDH.

After it's been running for several minutes if I look at "Input Metadata" I
can see that the direct stream is consuming 1 record / partition / sec.  I
have maxrate set at 10,000 records / partition / sec.

I'll file a bug today unless someone has any ideas?

Thanks!

Jeff


On Fri, Sep 9, 2016 at 5:54 PM, Jeff Nadler  wrote:

> Yes I'll test that next.
>
> On Sep 9, 2016 5:36 PM, "Cody Koeninger"  wrote:
>
>> Does the same thing happen if you're only using direct stream plus back
>> pressure, not the receiver stream?
>>
>> On Sep 9, 2016 6:41 PM, "Jeff Nadler"  wrote:
>>
>>> Maybe this is a pretty esoteric implementation, but I'm seeing some bad
>>> behavior with backpressure plus multiple Kafka streams / direct streams.
>>>
>>> Here's the scenario:
>>> We have 1 Kafka topic using the reliable receiver (4 receivers, union
>>> the result).In the same app, we consume another Kafka topic using a
>>> direct stream.
>>>
>>> This may seem strange, but it's necessary in my application to work
>>> around another problem:   Maxrate is set globally in SparkConf.IMO It
>>> would be more flexible if we could set maxrate for each stream
>>> independently.   Since directstream uses a different config parameter for
>>> maxrate, we get the desired result.
>>>
>>> A bit hacky I know.
>>>
>>> Anyway, we recently turned on backpressure.   It works as expected for
>>> the receiver-based stream. For the direct stream, it starts out at the
>>> maxrate (as expected) on the first batch.Then it ratchets down the
>>> consumption until it is eventually consuming 1 record / second / partition.
>>>
>>> This happens even though there's no scheduling delay, and the
>>> receiver-based stream does not appear to be throttled.
>>>
>>> Anyone ever see anything like this?
>>>
>>> Thanks!
>>>
>>> Jeff Nadler
>>> Aerohive Networks
>>>
>>>


RE: Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Adline Dsilva
Hi
Take a look into https://github.com/lucidworks/spark-solr . this support 
authentication with kerberized solr. Unfortunately this implementation has 
support from solr 5.x+. and CDH has Solr 4.x. One option is to use Apache Solr 
6.X with CDH.

Regards,
Adline
Sent from Mail for Windows 10

From: Nkechi Achara
Sent: Wednesday, September 14, 2016 7:52 PM
To: user@spark.apache.org
Subject: Anyone got a good solid example of integrating Spark and Solr

Hi All,

I am trying to find some good examples on how to implement Spark and Solr and 
coming up blank. Basically the implementation of spark-solr does not seem to 
work correctly with CDH 552 (*1.5.x branch) where i am receiving various issues 
relating to dependencies, which I have not been fully able to unravel.

I have also attempted to implement a custom solution, where i copy the token 
and jaas to each executor, and set the necessary auth Properties, but this 
still is prone to failure due to serialization, and kerberos auth issues.

Has anyone got an example of an implementation of querying solr in a 
distributed way where kerberos is involved?

Thanks,

K


DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.


Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
Sqoop is a standalone product (a utility) that is used to get data out of
JDBC compliant database tables into HDFS and Hive if specified.
Spark can also use JDBC to get data out from such tables. However, I have
not come across a situation where Sqoop is invoked from Spark.

Have a look at Sqoop doc



HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 15:31, KhajaAsmath Mohammed  wrote:

> Hi Experts,
>
> Good morning.
>
> I am looking for some references on how to use sqoop with spark. could you
> please let me know if there are any references on how to use it.
>
> Thanks,
> Asmath.
>


答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong

Thanks for your reply.

you mean i have to insert some codes, such as x.count or x.collect, 
between the original spark code lines to invoke some operations, right?
but, where is the right places to put my code lines?

Felix


发件人: Dirceu Semighini Filho 
发送时间: 2016年9月14日 22:33
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: t it does not stop at breakpoints which is in an anonymous function

Hello Felix,
Spark functions run lazy, and that's why it doesn't stop in those breakpoints.
They will be executed only when you call some methods of your dataframe/rdd, 
like the count, collect, ...

Regards,
Dirceu

2016-09-14 11:26 GMT-03:00 chen yong 
>:

Hi all,



I am newbie to spark. I am learning spark by debugging the spark code. It is 
strange to me that it does not stop at breakpoints  which is in  an anonymous 
function, it is normal in ordianry function, though. It that normal. How to 
obverse variables in  an anonymous function.


Please help me. Thanks in advance!


Felix



Re: unsubscribe

2016-09-14 Thread Daniel Lopes
Hi Chang,

just send a e-mail to user-unsubscr...@spark.apache.org

Best,

*Daniel Lopes*
Chief Data and Analytics Officer | OneMatch
c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes

www.onematch.com.br


On Tue, Sep 13, 2016 at 12:38 AM, ChangMingMin(常明敏) <
chang_ming...@founder.com> wrote:

> unsubscribe
>


Re: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread Dirceu Semighini Filho
Hello Felix,
Spark functions run lazy, and that's why it doesn't stop in those
breakpoints.
They will be executed only when you call some methods of your
dataframe/rdd, like the count, collect, ...

Regards,
Dirceu

2016-09-14 11:26 GMT-03:00 chen yong :

> Hi all,
>
>
>
> I am newbie to spark. I am learning spark by debugging the spark code. It
> is strange to me that it does not stop at breakpoints  which is
> in  an anonymous function, it is normal in ordianry function, though. It
> that normal. How to obverse variables in  an anonymous function.
>
>
> Please help me. Thanks in advance!
>
>
> Felix
>


Sqoop on Spark

2016-09-14 Thread KhajaAsmath Mohammed
Hi Experts,

Good morning.

I am looking for some references on how to use sqoop with spark. could you
please let me know if there are any references on how to use it.

Thanks,
Asmath.


t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread chen yong
Hi all,



I am newbie to spark. I am learning spark by debugging the spark code. It is 
strange to me that it does not stop at breakpoints  which is in  an anonymous 
function, it is normal in ordianry function, though. It that normal. How to 
obverse variables in  an anonymous function.


Please help me. Thanks in advance!


Felix


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
It could be that by using the rdd it converts the data from the internal format 
to Java objects (-> much more memory is needed), which may lead to spill over 
to disk. This conversion takes a lot of time. Then, you need to transfer these 
Java objects via network to one single node (repartition ...), which takes on a 
1 gbit network for 3 gb (since it may transfer Java objects this might be even 
more for 3 gb) under optimal conditions ca 25 seconds (if no other transfers 
happening at the same time, jumbo frames activated etc). On the destination 
node we may have again spill over to disk. Then you store them to a single disk 
(potentially multiple if you have and use HDFS) which takes also time (assuming 
that no other process uses this disk). 

Btw spark-csv can be used with different dataframes.
As said, other options are compression, avoid repartitioning (to avoid network 
transfer), avoid spilling to disk (provide memory in yarn etc), increase 
network bandwidth ...

> On 14 Sep 2016, at 14:22, sanat kumar Patnaik  wrote:
> 
> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
> 
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
> 
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
> 
> That is where I am concerned, the time to write a text file compared to json 
> grows exponentially.
> 
>> On Wednesday, September 14, 2016, Mich Talebzadeh 
>>  wrote:
>> These intermediate file what sort of files are there. Are there csv type 
>> files.
>> 
>> I agree that DF is more efficient than an RDD as it follows tabular format 
>> (I assume that is what you mean by "columnar" format). So if you read these 
>> files in a bath process you may not worry too much about execution time?
>> 
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I 
>> think it is pretty efficient.
>> 
>> For myself, I would do something like below
>> 
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 14 September 2016 at 12:46, sanat kumar Patnaik 
>>>  wrote:
>>> Hi All,
>>> 
>>> I am writing a batch application using Spark SQL and Dataframes. This 
>>> application has a bunch of file joins and there are intermediate points 
>>> where I need to drop a file for downstream applications to consume.
>>> The problem is all these downstream applications are still on legacy, so 
>>> they still require us to drop them a text file.As you all must be knowing 
>>> Dataframe stores the data in columnar format internally.
>>> Only way I found out how to do this and which looks awfully slow is this:
>>> 
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>  
>>> Is there any better way to do this?
>>> 
>>> P.S: The other workaround would be to use RDDs for all my operations. But I 
>>> am wary of using them as the documentation says Dataframes are way faster 
>>> because of the Catalyst engine running behind the scene.
>>> 
>>> Please suggest if any of you might have tried something similar.
>> 
> 
> 
> -- 
> Regards,
> Sanat Patnaik
> Cell->804-882-6424


Re: Spark Interview questions

2016-09-14 Thread Mich Talebzadeh
Hi Ashok,

I am sure we all have some war stories some of which I recall:


   1. What is meant by RDD, DataFrame and Dataset
   2. What is the meant by "All transformations in Spark are lazy"?
   3. What are the two types of operations supported by RDD?
   4. What is meant by Spark running under a certain mode?
   5. Explain the difference between Spark Running in a Standalone mode and
   Yarn cluster mode
   6. What is the difference between Spark running in Yarn client mode and
   Yarn cluster mode.
   7. What is the difference between persist and cache
   8. If you cache a DataFrame what does it do and where is the memory
   consumed come from. Can you give a place where you can see its measurements
   9. What is meant by DAG? A broad outline
   10. What is shuffling in Spark. How can you minimise its impact
   11. How would you specify your spark hardware in a medium size set-up
   say 8 node cluster.
   12. How could one minimise the network latency within Spark and the
   underlying storage (assuming HDFS here)
   13. How can you parallelize your JDBC connection to a database say any
   RDBMS? How does it work
   14. What is the use case for Spark Thrift Server.
   15. How would you typically read and process a tab separated file into
   Spark
   16. If you have an OOM message in Spark how would you go about
   diagnosing the problem
   17. What is meant by spark-submit. How would you use it
   18. What is a Spark driver? If you run Spark in Local mode how many
   executors can you start
   19. What is meant by Spark Streaming. What is a use case example
   20. In Spark Streaming what parameters are important
   21. What are the typical analytic functions in Spark SQL
   22. What is the difference between RANK and DENSE_RANK


- I am sure there are many other questions that one think of. For example,
someone like Jacek Laskowski can provide more programming questions as he
is a professional Spark trainer :)

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 12:35, Ashok Kumar 
wrote:

> Hi,
>
> As a learner I appreciate if you have typical Spark interview questions
> for Spark/Scala junior roles that you can please forward to me.
>
> I will be very obliged
>


Error casting from data frame to case class object

2016-09-14 Thread franz_butterbaum
I'm exploring features of Spark 2.0, and am trying to load a simple csv file
into a dataset.

These are the contents of my file named people.csv:

name,age,occupation
John,21,student
Mark,33,analyst
Susan,27,scientist

Below is my code: 

import org.apache.spark.sql._
val spark =
SparkSession.builder().appName("test").master("local").getOrCreate
val data = spark.read.option("header", true).csv("people.csv")
import spark.implicits._
case class People(name: String, age: Int, occupation: String)
val people = train.as[People]

Running the above gives me the following error:

org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to
int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "People"
You can either add an explicit cast to the input data or choose a higher
precision type of the field in the target object;

How do I fix that? Is this not the right way to directly load data from csv
into a dataset using the desired case class?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-casting-from-data-frame-to-case-class-object-tp27717.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
As I understand you cannot deliver json file downstream as they want text
format.

If it is batch processing, what is the window of delivery within the SLA?

To write a 3GB file in 160 seconds means that it takes > 50 seconds to
write 1 Gig which looks a long time to me. Even talking one minute for json
looks excessive.

Is your Spark on the same sub-net as your HDFS if HDFS and Spark are not
sharing the same hardware?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 13:22, sanat kumar Patnaik 
wrote:

> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
>
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
>
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
>
> That is where I am concerned, the time to write a text file compared to
> json grows exponentially.
>
> On Wednesday, September 14, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> These intermediate file what sort of files are there. Are there csv type
>> files.
>>
>> I agree that DF is more efficient than an RDD as it follows tabular
>> format (I assume that is what you mean by "columnar" format). So if you
>> read these files in a bath process you may not worry too much about
>> execution time?
>>
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I
>> think it is pretty efficient.
>>
>> For myself, I would do something like below
>>
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <
>> patnaik.sa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>>- I am writing a batch application using Spark SQL and Dataframes.
>>>This application has a bunch of file joins and there are intermediate
>>>points where I need to drop a file for downstream applications to 
>>> consume.
>>>- The problem is all these downstream applications are still on
>>>legacy, so they still require us to drop them a text file.As you all must
>>>be knowing Dataframe stores the data in columnar format internally.
>>>
>>> Only way I found out how to do this and which looks awfully slow is this:
>>>
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>
>>> Is there any better way to do this?
>>>
>>> *P.S: *The other workaround would be to use RDDs for all my operations.
>>> But I am wary of using them as the documentation says Dataframes are way
>>> faster because of the Catalyst engine running behind the scene.
>>>
>>> Please suggest if any of you might have tried something similar.
>>>
>>
>>
>
> --
> Regards,
> Sanat Patnaik
> Cell->804-882-6424
>


Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Kevin Tran
Hi Everyone,

I tried in cluster mode on YARN
 * spark-submit  --jars /path/sqldriver.jar
 * --driver-class-path
 * spark-env.sh
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/*"
 * spark-defaults.conf
spark.driver.extraClassPath
spark.executor.extraClassPath

None of them works for me !

Does anyone have Spark app work with driver jar on executors before please
give me your ideas. Thank you.

Cheers,
Kevin.


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Jörn Franke
Hi,

DataFrames are more efficient if you have Tungsten activated as the underlying 
processing engine (normally by default). However, this only speeds up 
processing , saving as an io-bound operation not necessarily.

What is exactly slow? The write? 
You could use myDF.write.save().write...

However, repartition (1) means that everything is dumped into one executor and 
if there is a lot of data this may lead to network congestion.
Better (if it is supported by the legacy application) is to write each 
partition individually in a file.

If your processing is slow then you need to provide more concrete examples.


Best regards

> On 14 Sep 2016, at 14:10, Mich Talebzadeh  wrote:
> 
> These intermediate file what sort of files are there. Are there csv type 
> files.
> 
> I agree that DF is more efficient than an RDD as it follows tabular format (I 
> assume that is what you mean by "columnar" format). So if you read these 
> files in a bath process you may not worry too much about execution time?
> 
> A textFile saving is simply a one to one mapping from your DF to HDFS. I 
> think it is pretty efficient.
> 
> For myself, I would do something like below
> 
> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 14 September 2016 at 12:46, sanat kumar Patnaik  
>> wrote:
>> Hi All,
>> 
>> I am writing a batch application using Spark SQL and Dataframes. This 
>> application has a bunch of file joins and there are intermediate points 
>> where I need to drop a file for downstream applications to consume.
>> The problem is all these downstream applications are still on legacy, so 
>> they still require us to drop them a text file.As you all must be knowing 
>> Dataframe stores the data in columnar format internally.
>> Only way I found out how to do this and which looks awfully slow is this:
>> 
>> myDF=sc.textFile("inputpath").toDF()
>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>  
>> Is there any better way to do this?
>> 
>> P.S: The other workaround would be to use RDDs for all my operations. But I 
>> am wary of using them as the documentation says Dataframes are way faster 
>> because of the Catalyst engine running behind the scene.
>> 
>> Please suggest if any of you might have tried something similar.
> 


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
These are not csv files, utf8 files with a specific delimiter.
I tried this out with a file(3 GB):

myDF.write.json("output/myJson")
Time taken- 60 secs approximately.

myDF.rdd.repartition(1).saveAsTextFile("output/text")
Time taken 160 secs

That is where I am concerned, the time to write a text file compared to
json grows exponentially.

On Wednesday, September 14, 2016, Mich Talebzadeh 
wrote:

> These intermediate file what sort of files are there. Are there csv type
> files.
>
> I agree that DF is more efficient than an RDD as it follows tabular format
> (I assume that is what you mean by "columnar" format). So if you read these
> files in a bath process you may not worry too much about execution time?
>
> A textFile saving is simply a one to one mapping from your DF to HDFS. I
> think it is pretty efficient.
>
> For myself, I would do something like below
>
> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 12:46, sanat kumar Patnaik <
> patnaik.sa...@gmail.com
> > wrote:
>
>> Hi All,
>>
>>
>>- I am writing a batch application using Spark SQL and Dataframes.
>>This application has a bunch of file joins and there are intermediate
>>points where I need to drop a file for downstream applications to consume.
>>- The problem is all these downstream applications are still on
>>legacy, so they still require us to drop them a text file.As you all must
>>be knowing Dataframe stores the data in columnar format internally.
>>
>> Only way I found out how to do this and which looks awfully slow is this:
>>
>> myDF=sc.textFile("inputpath").toDF()
>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>
>> Is there any better way to do this?
>>
>> *P.S: *The other workaround would be to use RDDs for all my operations.
>> But I am wary of using them as the documentation says Dataframes are way
>> faster because of the Catalyst engine running behind the scene.
>>
>> Please suggest if any of you might have tried something similar.
>>
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424


Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread Mich Talebzadeh
These intermediate file what sort of files are there. Are there csv type
files.

I agree that DF is more efficient than an RDD as it follows tabular format
(I assume that is what you mean by "columnar" format). So if you read these
files in a bath process you may not worry too much about execution time?

A textFile saving is simply a one to one mapping from your DF to HDFS. I
think it is pretty efficient.

For myself, I would do something like below

myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 12:46, sanat kumar Patnaik 
wrote:

> Hi All,
>
>
>- I am writing a batch application using Spark SQL and Dataframes.
>This application has a bunch of file joins and there are intermediate
>points where I need to drop a file for downstream applications to consume.
>- The problem is all these downstream applications are still on
>legacy, so they still require us to drop them a text file.As you all must
>be knowing Dataframe stores the data in columnar format internally.
>
> Only way I found out how to do this and which looks awfully slow is this:
>
> myDF=sc.textFile("inputpath").toDF()
> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>
> Is there any better way to do this?
>
> *P.S: *The other workaround would be to use RDDs for all my operations.
> But I am wary of using them as the documentation says Dataframes are way
> faster because of the Catalyst engine running behind the scene.
>
> Please suggest if any of you might have tried something similar.
>


Anyone got a good solid example of integrating Spark and Solr

2016-09-14 Thread Nkechi Achara
Hi All,

I am trying to find some good examples on how to implement Spark and Solr
and coming up blank. Basically the implementation of spark-solr does not
seem to work correctly with CDH 552 (*1.5.x branch) where i am receiving
various issues relating to dependencies, which I have not been fully able
to unravel.

I have also attempted to implement a custom solution, where i copy the
token and jaas to each executor, and set the necessary auth Properties, but
this still is prone to failure due to serialization, and kerberos auth
issues.

Has anyone got an example of an implementation of querying solr in a
distributed way where kerberos is involved?

Thanks,

K


Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

2016-09-14 Thread sanat kumar Patnaik
Hi All,


   - I am writing a batch application using Spark SQL and Dataframes. This
   application has a bunch of file joins and there are intermediate points
   where I need to drop a file for downstream applications to consume.
   - The problem is all these downstream applications are still on legacy,
   so they still require us to drop them a text file.As you all must be
   knowing Dataframe stores the data in columnar format internally.

Only way I found out how to do this and which looks awfully slow is this:

myDF=sc.textFile("inputpath").toDF()
myDF.rdd.repartition(1).saveAsTextFile("mypath/output")

Is there any better way to do this?

*P.S: *The other workaround would be to use RDDs for all my operations. But
I am wary of using them as the documentation says Dataframes are way faster
because of the Catalyst engine running behind the scene.

Please suggest if any of you might have tried something similar.


Spark Interview questions

2016-09-14 Thread Ashok Kumar
Hi,
As a learner I appreciate if you have typical Spark interview questions for 
Spark/Scala junior roles that you can please forward to me.
I will be very obliged

Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
Thanks for the awesome explanation! It's super clear to me now :)

On Tue, Sep 13, 2016 at 4:42 PM, Cody Koeninger  wrote:

> The DStream implementation decides how to produce an RDD for a time
> (this is the compute method)
>
> The RDD implementation decides how to partition things (this is the
> getPartitions method)
>
> You can look at those methods in DirectKafkaInputDStream and KafkaRDD
> respectively if you want to see an example
>
> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie  wrote:
> > Ah, that makes it much clearer, thanks!
> >
> > It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) decides?
> >
> > On Tue, Sep 13, 2016 at 4:26 PM, Cody Koeninger 
> wrote:
> >>
> >> A micro batch is an RDD.
> >>
> >> An RDD has partitions, so different executors can work on different
> >> partitions concurrently.
> >>
> >> Don't think of that as multiple micro-batches within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie 
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >> > the
> >> > distributed nature of RDDs vs the small nature of "micro batches" and
> on
> >> > how
> >> > Spark Streaming distributes work.
> >> >
> >> > On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> >> > 
> >> > wrote:
> >> >>
> >> >> Hi Daan,
> >> >>
> >> >> You may find this link Re: Is "spark streaming" streaming or
> >> >> mini-batch?
> >> >> helpful. This was a thread in this forum not long ago.
> >> >>
> >> >> HTH
> >> >>
> >> >> Dr Mich Talebzadeh
> >> >>
> >> >>
> >> >>
> >> >> LinkedIn
> >> >>
> >> >> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >> >>
> >> >>
> >> >>
> >> >> http://talebzadehmich.wordpress.com
> >> >>
> >> >>
> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for
> any
> >> >> loss, damage or destruction of data or any other property which may
> >> >> arise
> >> >> from relying on this email's technical content is explicitly
> >> >> disclaimed. The
> >> >> author will in no case be liable for any monetary damages arising
> from
> >> >> such
> >> >> loss, damage or destruction.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On 13 September 2016 at 14:25, DandyDev 
> wrote:
> >> >>>
> >> >>> Hi all!
> >> >>>
> >> >>> When reading about Spark Streaming and its execution model, I see
> >> >>> diagrams
> >> >>> like this a lot:
> >> >>>
> >> >>>
> >> >>>
> >> >>>  file/n27699/lambda-architecture-with-spark-spark-
> streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
> >> >>>
> >> >>> It does a fine job explaining how DStreams consist of micro batches
> >> >>> that
> >> >>> are
> >> >>> basically RDDs. There are however some things I don't understand:
> >> >>>
> >> >>> - RDDs are distributed by design, but micro batches are conceptually
> >> >>> small.
> >> >>> How/why are these micro batches distributed so that they need to be
> >> >>> implemented as RDD?
> >> >>> - The above image doesn't explain how Spark Streaming parallelizes
> >> >>> data.
> >> >>> According to the image, a stream of events get broken into micro
> >> >>> batches
> >> >>> over the axis of time (time 0 to 1 is a micro batch, time 1 to 2 is
> a
> >> >>> micro
> >> >>> batch, etc.). How does parallelism come into play here? Is it that
> >> >>> even
> >> >>> within a "time slot" (eg. time 0 to 1) there can be so many events,
> >> >>> that
> >> >>> multiple micro batches for that time slot will be created and
> >> >>> distributed
> >> >>> across the executors?
> >> >>>
> >> >>> Clarification would be helpful!
> >> >>>
> >> >>> Daan
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> View this message in context:
> >> >>>
> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-
> Streaming-dividing-DStream-into-mini-batches-tp27699.html
> >> >>> Sent from the Apache Spark User List mailing list archive at
> >> >>> Nabble.com.
> >> >>>
> >> >>> 
> -
> >> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >> >>>
> >> >>
> >> >
> >
> >
>


Re: Spark stalling during shuffle (maybe a memory issue)

2016-09-14 Thread bogdanbaraila
Hello Jonathan

Did you found any working solution for your issue? If yes could you please
share it?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-stalling-during-shuffle-maybe-a-memory-issue-tp6067p27716.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark SQL Thriftserver

2016-09-14 Thread Mich Talebzadeh
Actually this is what it says

Connecting to jdbc:hive2://rhes564:10055
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive

So it uses Spark SQL. However, they do not seem to have upgraded Beeline
version from 1.2.1

HTH

It is a useful tool with Zeppelin.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 00:55, ayan guha  wrote:

> Hi
>
> AFAIK STS uses Spark SQL and not Map Reduce. Is that not correct?
>
> Best
> Ayan
>
> On Wed, Sep 14, 2016 at 8:51 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> STS will rely on Hive execution engine. My Hive uses Spark execution
>> engine so STS will pass the SQL to Hive and let it do the work and return
>> the result set
>>
>>  which beeline
>> /usr/lib/spark-2.0.0-bin-hadoop2.6/bin/beeline
>> ${SPARK_HOME}/bin/beeline -u jdbc:hive2://rhes564:10055 -n hduser -p
>> 
>> Connecting to jdbc:hive2://rhes564:10055
>> Connected to: Spark SQL (version 2.0.0)
>> Driver: Hive JDBC (version 1.2.1.spark2)
>> Transaction isolation: TRANSACTION_REPEATABLE_READ
>> Beeline version 1.2.1.spark2 by Apache Hive
>> 0: jdbc:hive2://rhes564:10055>
>>
>> jdbc:hive2://rhes564:10055> select count(1) from test.prices;
>> Ok I did a simple query in STS, You will this in hive.log
>>
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:50,996 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:50,998 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,007 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217
>> get_database: test
>> 2016-09-13T23:44:51,021 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_database: test
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,023 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: metastore.HiveMetaStore
>> (HiveMetaStore.java:logInfo(670)) - 4: source:50.140.197.217 get_table :
>> db=test tbl=prices
>> 2016-09-13T23:44:51,029 INFO  [pool-4-thread-4]: HiveMetaStore.audit
>> (HiveMetaStore.java:logAuditEvent(280)) - ugi=hduser
>> ip=50.140.197.217   cmd=source:50.140.197.217 get_table : db=test
>> tbl=prices
>>
>> I think it is a good idea to switch to Spark engine (as opposed to MR).
>> My tests proved that Hive on Spark using DAG and in-memory offering runs at
>> least by order of magnitude faster compared to map-reduce.
>>
>> You can either connect to beeline from $HIVE_HOME/... or beeline from
>> $SPARK_HOME
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this 

Re: how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Deepak Sharma
I am not sure about EMR , but seems multi tenancy is not enabled in your
case.
Multi tenancy means all the applications has to be submitted to different
queues.

Thanks
Deepak

On Wed, Sep 14, 2016 at 11:37 AM, Divya Gehlot 
wrote:

> Hi,
>
> I am on EMR cluster and My cluster configuration is as below:
> Number of nodes including master node - 3
> Memory:22.50 GB
> VCores Total : 16
> Active Nodes : 2
> Spark version- 1.6.1
>
> Parameter set in spark-default.conf
>
> spark.executor.instances 2
>> spark.executor.cores 8
>> spark.driver.memory  10473M
>> spark.executor.memory9658M
>> spark.default.parallelism32
>
>
> Would let me know if need any other info regarding the cluster .
>
> The current configuration for spark-submit is
> --driver-memory 5G \
> --executor-memory 2G \
> --executor-cores 5 \
> --num-executors 10 \
>
>
> Currently  with the above job configuration if I try to run another spark
> job it will be in accepted state till the first one finishes .
> How do I optimize or update the above spark-submit configurations to run
> some more spark jobs simultaneously
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>
>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Divya Gehlot
Hi,

I am on EMR cluster and My cluster configuration is as below:
Number of nodes including master node - 3
Memory:22.50 GB
VCores Total : 16
Active Nodes : 2
Spark version- 1.6.1

Parameter set in spark-default.conf

spark.executor.instances 2
> spark.executor.cores 8
> spark.driver.memory  10473M
> spark.executor.memory9658M
> spark.default.parallelism32


Would let me know if need any other info regarding the cluster .

The current configuration for spark-submit is
--driver-memory 5G \
--executor-memory 2G \
--executor-cores 5 \
--num-executors 10 \


Currently  with the above job configuration if I try to run another spark
job it will be in accepted state till the first one finishes .
How do I optimize or update the above spark-submit configurations to run
some more spark jobs simultaneously

Would really appreciate the help.

Thanks,
Divya